Lead the team responsible for reliability across Akamai's AI compute and platform services
Build the team, owning hiring strategy, candidate evaluation, and interview coordination for AI SRE roles
Partner with product engineering teams to embed reliability into the development lifecycle
Define and implement SRE practices for Akamai's AI compute and platform services
Ensure operational readiness for AI products by establishing quality gates, on-call rotations, runbooks, and escalation paths for AI infrastructure failure mode
Scale operations through software and automation, reducing toil and driving the team toward programmatic solutions over manual intervention
Own incident management integration for AI workloads, including post-incident analysis and driving systemic improvements that prevent recurrence
Requirements
Have extensive experience in SRE, infrastructure, or platform engineering, with an expertise in leading SRE teams
Track record of building SRE teams and practices, ideally in an environment where SRE was new or being established
Demonstrate expertise in SLOs/SLIs, observability tools, and large-scale incident management while ensuring operational efficiency.
Demonstrate expertise with Kubernetes and containerization in large-scale environments.
Demonstrate expertise in Python or Go automation and tooling, while possessing knowledge of Linux systems and networking fundamentals.
Manage CI/CD pipelines, implement deployment safety measures, and utilize infrastructure-as-code tools like Terraform or similar alternatives.
Build relationships with product engineering teams while effectively communicating SRE value in terms relevant to engineering partners.