Manage and mentor a team of site reliability engineers, setting performance objectives, providing technical guidance, and ensuring alignment with business goals.
Oversee the execution of reliability initiatives, ensuring critical systems maintain high availability, resilience, and performance at scale.
Work with engineering, operations, and product teams to ensure seamless integration of reliability best practices into the development, deployment, and operational processes.
Lead incident management activities, including coordination of response efforts, root cause analysis, and implementing solutions to prevent future incidents.
Define and track key performance indicators (KPIs) related to system reliability, availability, and performance, reporting results to leadership regularly.
Promote and drive automation within the site reliability engineering team, ensuring processes are streamlined and systems operate with minimal manual intervention.
Manage capacity planning efforts, ensuring the scalability of systems and the ability to handle increasing traffic and resource demands effectively.
Ensure the development and testing of disaster recovery plans and procedures, minimizing downtime in the event of a failure.
Lead career development and mentorship efforts for team members, ensuring engineers have the tools and opportunities to grow their skills and advance their careers.
Requirements
8+ years relevant experience and a Bachelor’s degree OR Any equivalent combination of education and experience.
Experience leading others
Bachelor’s degree in computer science, Information Technology, or related field; Master's preferred.
8+ years of experience in infrastructure management, with at least 3 years in a leadership role.
Extensive experience with multiple cloud platforms (AWS, Azure, GCP) and on-premises infrastructure management.
Demonstrated experience building or scaling AI/ML-based automation for operations; including AIOps platforms, alert noise reduction, auto-remediation, and intelligent runbooks.
Strong background in incident management, ITIL frameworks, and operational best practices.
Experience with monitoring tools, automation platforms, and infrastructure-as-code technologies.
Tech Stack
AWS
Azure
Cloud
Google Cloud Platform
Benefits
generous paid time off
healthcare coverage for you and your family
resources to create financial security and support your mental health