Own the reliability and operational health of enterprise operational efficiency platforms, with a mix of end
to
end ownership and shared operational responsibility
Lead and develop a geographically dispersed team across North America and Europe, including managing an on
call rotation
Establish and evolve a standard operational model for change management and incident response across platforms
Drive operational rigor through strong observability practices, including metrics, alerting, and insight into platform health
Lead response to major incidents, ensuring clear communication, effective coordination, root cause identification, and durable remediation
Act as the primary operational point of contact for SaaS platform vendors, holding providers accountable for reliability, incident response, and service commitments
Communicate platform health, risks, and tradeoffs in business
relevant terms to functional partners and leadership
Detail operational standards as context for AI (and leverage AI) to improve reliability practices
Requirements
Bachelor's degree in engineering or information systems
10+ years of experience in a similar role
Experience managing and developing a global, distributed reliability team
Strong understanding of observability, incident management, and operational standard methodologies
Experience crafting or enforcing change and deployment processes that balance speed with stability
Demonstrated ability to manage vendor relationships, including setting expectations, reviewing performance, and driving accountability during incidents or service degradation
Familiar with employing AI effectively through context curation and documentation to achieve high velocity and quality in execution.