Lead a team of SREs (up to ~15) and create a culture of continuous improvement, learning, and engineering excellence.
Work closely with application teams during application migrations to the Cloud.
Work closely with Product Owners and Engineering Leads to balance new feature delivery with reliability, performance and system health.
Use data, observability tooling and SRE principles to detect issues early, improve system performance, and reduce operational toil.
Lead and mature incident and problem management practices, ensuring strong root‑cause analysis, learning, and reduction of MTTF/MTTR.
Champion error budgets, SLOs, and reliability‑first thinking across your aligned Cloud Labs.
Influence platform direction and engineering standards, helping shape how we build resilient cloud services at scale.

Strong cloud engineering background — ideally across GCP and Azure — with experience designing or operating large‑scale, resilient cloud platforms.
Deep understanding of observability tooling (metrics, logs, traces) and how to drive reliability improvements using data.
Hands‑on experience of modern SRE practices: SLOs / SLIs, Error budgets, Reducing toil through automation, Production readiness and post‑mortem best practice
Experience leading engineering teams and fostering an inclusive, high‑performing culture
Ability to navigate complex stakeholder groups and communicate technical topics in a clear, accessible way.

Lead Cloud Site Reliability Engineer

Key skills