Own SLOs, SLIs, and error budgets for all production services
Drive error budget discipline across engineering
Design reliability patterns for AI agent pipelines
Lead incident response and post-incident reviews that produce durable fixes
Serve as the primary reliability liaison to Software and AI Engineering
Own CI/CD pipeline strategy
Own the service catalog
Operate Datadog as the single pane of glass for service health
Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience
6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor level
Deep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns
Proficiency with Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent)