About this role

Great Value Hiring is seeking experienced Site Reliability Engineers to enhance their production incident response and infrastructure reliability. The role focuses on evaluating and training AI models related to system failures and operational best practices.

Responsibilities:

Evaluate and train AI models that aim to reason about system failures, observability, and operational best practices

Requirements:

Have 3+ years of experience in SRE, DevOps, or production engineering at big tech company or leading startup
Have served in on-call rotations managing Tier 1/Tier 2 production services with meaningful SLA requirements
Have hands-on experience with incident response and post-mortem processes, including structured RCA (root cause analysis)
Are proficient with observability stacks: Prometheus, Grafana, Datadog, PagerDuty, or equivalent
Have deep knowledge of Linux systems, networking (TCP/IP, DNS, load balancing), and container orchestration (Kubernetes, Docker)
Have experience with infrastructure-as-code (Terraform, Pulumi, CloudFormation) and CI/CD pipelines
Have strong debugging skills across the stack (application-level tracing to kernel-level diagnostics)
Are currently based in the United States

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: