Lead a team of SREs (up to ~15) and create a culture of continuous improvement, learning, and engineering excellence.
Work closely with application teams during application migrations to the Cloud.
Work closely with Product Owners and Engineering Leads to balance new feature delivery with reliability, performance and system health.
Use data, observability tooling and SRE principles to detect issues early, improve system performance, and reduce operational toil.
Lead and mature incident and problem management practices, ensuring strong root‑cause analysis, learning, and reduction of MTTF/MTTR.
Champion error budgets, SLOs, and reliability‑first thinking across your aligned Cloud Labs.
Influence platform direction and engineering standards, helping shape how we build resilient cloud services at scale.
Requirements
Strong cloud engineering background — ideally across GCP and Azure — with experience designing or operating large‑scale, resilient cloud platforms.
Deep understanding of observability tooling (metrics, logs, traces) and how to drive reliability improvements using data.
Hands‑on experience of modern SRE practices: SLOs / SLIs, Error budgets, Reducing toil through automation, Production readiness and post‑mortem best practice
Experience leading engineering teams and fostering an inclusive, high‑performing culture
Ability to navigate complex stakeholder groups and communicate technical topics in a clear, accessible way.
Tech Stack
Azure
Cloud
Google Cloud Platform
Benefits
A generous pension contribution of up to 15%
An annual performance-related bonus
Share schemes including free shares
Benefits you can adapt to your lifestyle, such as discounted shopping
30 days’ holiday, with bank holidays on top
A range of wellbeing initiatives and generous parental leave policies