Provide leadership and mentorship to a team of 8-10 Site Reliability Engineers (SREs).
Possess expertise in defining, measuring, and reporting on key Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure adherence to the 99.99%+ uptime Service Level Agreement (SLA).
Collaborate effectively with other SRE, Security, and Development teams.
Define and implement processes to ensure the team efficiently meets target deadlines.
Drive the successful completion of large-scale projects, coordinating with multiple Development teams.
Conduct thorough capacity analysis and planning.
Effectively manage and scale infrastructure by establishing and adhering to automation standards.
Analyze and resolve complex system behavior, performance, and application issues.
Oversee comprehensive observability and analysis across multiple datacenters.
Requirements
Minimum of six years of experience leading a software-focused Site Reliability Engineering (SRE) team of eight to ten staff.
Demonstrated experience working within organizations operating on a global scale.
Proven ability to drive strategic decisions regarding "build vs. buy" technology choices.
Proficiency in developing, maintaining, and administering modern infrastructure tooling, with a strong emphasis on Infrastructure as Code (IaC) principles.
Experience provisioning public cloud resources utilizing IaC tools such as CloudFormation and Terraform.
Solid knowledge of scripting and programming standards (e.g., Python, Ruby, Bash, Go).
Experience with Docker and container orchestration platforms (e.g., Kubernetes).
Practical experience using Git in a large-scale team environment.
Understanding and application of security design principles.
Experience operating within a high-volume or mission-critical production service environment.
Expertise in IP networking, including familiarity with network functionality, operational procedures, and failure modes.