Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.
Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.
Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.
Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.
Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.
Ensure the team’s processes, documentation, and runbooks stay current and audited.
Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.
Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.
Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.
Participate in high-severity incidents as escalation point and technical lead when needed.
Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.
Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.
Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.
Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.
Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.
Requirements
6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.
Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.
Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.
Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).
Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.
Excellent operational judgment: triage, prioritization, team load balancing, and process design.