Provide strategic leadership and oversight for four SRE teams, setting clear direction, priorities, and expectations aligned to business and engineering objectives
Lead, mentor, and develop SRE managers and senior engineers, fostering a culture of accountability, operational ownership, innovation, and psychological safety
Define and own the SRE and Platform Engineering strategy and roadmap, ensuring alignment with cloud transformation initiatives and long-term organizational goals
Serve as a key voice in architectural and platform decisions, influencing designs with a focus on scalability, reliability, automation, and operational efficiency
Partner with executive leadership to communicate reliability posture, risks, and investment needs in clear business terms
Establish and continuously evolve SRE principles and best practices, including SLIs, SLOs, error budgets, toil management, and reliability-driven prioritization
Provide technical direction and governance across GCP (preferred) and AWS environments, ensuring consistent reliability and operational patterns
Drive the evolution of Platform Engineering, enabling self-service infrastructure and guard-railed service delivery for application teams
Own strategy and standards for Infrastructure-as-Code (IaC) and automation, leveraging tools such as Terraform or equivalent frameworks across cloud environments
Ensure observability excellence through metrics, logging, tracing, alerting, and proactive capacity and performance management
Provide executive leadership during large-scale or high-impact incidents, ensuring effective coordination, escalation, and stakeholder communication
Define, refine, and scale incident management and on-call practices, emphasizing resilience, sustainability, and rapid recovery
Champion blameless postmortems, ensuring root causes are addressed and learnings are translated into systemic improvements
Partner with Security and Compliance teams to ensure systems meet security, privacy, and regulatory requirements without compromising reliability
Own and report on reliability metrics, operational KPIs, and service health for leadership and executive stakeholders
Drive continuous improvement through reliability reviews, retrospectives, and data-driven decision-making
Balance reliability, velocity, and cost across platforms, applying error budgets and capacity planning to guide trade-offs
Requirements
10+ years of experience in SRE, infrastructure, platform, or systems engineering roles, with 5+ years leading managers and senior technical teams
Direct, hands-on experience in Site Reliability Engineering, including operating production systems at scale
Strong experience with Google Cloud Platform (GCP) or equivalent public cloud (AWS or Azure), including distributed, cloud-native architectures
Proven expertise in Infrastructure-as-Code (IaC) and automation frameworks (e.g., Terraform or similar)
Deep understanding of observability ecosystems (metrics, logging, tracing), CI/CD pipelines, and DevOps/SRE tooling
Ability to communicate complex technical concepts clearly to both technical and non-technical stakeholders, influencing at all levels of the organization.
Tech Stack
AWS
Azure
Cloud
Google Cloud Platform
Terraform
Benefits
Competitive total rewards (base salary + bonus, if applicable)
Customizable benefits package (3 medical plans with Health Saving Account company match)
Generous paid time off for non-exempt team members, starting with 3 weeks + 13 paid holidays, including 2 personal floating holidays
Flexible time off for exempt team members + 13 paid holidays