Define and drive reliability strategy across control-plane and data-plane systems, including multi-region resilience, BCDR, and failover design
Establish and operationalize SLOs, SLAs, and error budgets, ensuring they inform planning and engineering tradeoffs
Lead initiatives that measurably improve MTTR, incident prevention, and overall service health
Own incident management end-to-end, driving systemic fixes and long-term reliability improvements beyond immediate response
Lead architecture and design reviews to ensure systems meet scalability, reliability, and cost efficiency goals
Champion automation and modernization, including AI-driven reliability improvements
Establish and enforce code quality and review standards
Lead cross-functional initiatives and align engineering with product priorities
Mentor senior engineers and act as a technical leader across teams

6+ years leading delivery of complex, distributed systems or SaaS platforms
Strong experience with multi-region, split-plane architectures (control-plane / data-plane)
Proven track record improving SLOs, MTTR, and system reliability at scale
Proficiency in languages like Python, Java, C++, or JavaScript
Deep experience with:
Kubernetes (multi-cluster), CI/CD, and GitOps (ArgoCD)
SLO/SLA design, observability, and incident management
Infrastructure as Code and cloud platforms
Disaster recovery, resilience, and security best practices
Strong leadership skills with experience mentoring senior engineers and influencing cross-team decisions
Nice to Have
Experience with chaos engineering and large-scale reliability automation
Background in enterprise SaaS platforms or split-plane architectures
Expertise in navigating, understanding and leveraging modern Observability platforms (Datadog, Grafana, etc)

Lead Site Reliability Engineer

Key skills