Defining the reliability architecture for Akamai's AI compute and platform services, including SLO frameworks, fault tolerance patterns, and capacity planning models
Hands-on building of automation and tooling that reduces operational toil and scales the SRE team's impact
Designing observability strategy by leveraging Akamai's existing platform to build the telemetry, dashboards, alerts, and GPU-specific monitoring needed for AI workloads
Architecting deployment safety practices including progressive rollouts, canary analysis, rollback automation, and change safety processes
Influencing product engineering architecture and design decisions, embedding reliability into the development lifecycle at the system level
Mentoring and elevating other SREs through design reviews, code reviews, and hands-on problem-solving, setting the technical bar for the team
Requirements
Have extensive experience in SRE, platform engineering, and/or infrastructure engineering, with demonstrated impact at a principal or staff level
Demonstrate extensive Kubernetes expertise, managing autoscaling, resource scheduling, and container orchestration for handling compute-intensive workloads effectively.
Develop programming expertise in Python or Go, focusing on creating automation and tooling for production-grade environments.
Demonstrate expertise in programming with Python and/or Go, coupled with experience creating production-grade automation, tooling, and platform services.
Influence cross-team technical decisions, mentor engineers, elevate technical standards, and collaborate effectively with product engineering teams.
Gain experience in AI/ML infrastructure, model deployment, or GPU workloads to enhance technical expertise and practical understanding.
Design reliability into innovative platforms at the system level while building influence with product engineering teams through technical expertise.