Great Value Hiring is seeking experienced Site Reliability Engineers to enhance their production incident response and infrastructure reliability. The role focuses on evaluating and training AI models related to system failures and operational best practices.
Responsibilities:
- Evaluate and train AI models that aim to reason about system failures, observability, and operational best practices
Requirements:
- Have 3+ years of experience in SRE, DevOps, or production engineering at big tech company or leading startup
- Have served in on-call rotations managing Tier 1/Tier 2 production services with meaningful SLA requirements
- Have hands-on experience with incident response and post-mortem processes, including structured RCA (root cause analysis)
- Are proficient with observability stacks: Prometheus, Grafana, Datadog, PagerDuty, or equivalent
- Have deep knowledge of Linux systems, networking (TCP/IP, DNS, load balancing), and container orchestration (Kubernetes, Docker)
- Have experience with infrastructure-as-code (Terraform, Pulumi, CloudFormation) and CI/CD pipelines
- Have strong debugging skills across the stack (application-level tracing to kernel-level diagnostics)
- Are currently based in the United States