Yahoo is a technology company that connects brands and partners with a vast audience. They are seeking a strategic Senior Engineering Manager to lead their Tooling & Reliability Platforms team, focusing on building modern, efficient platforms and managing a team of engineers responsible for incident management and reliability tools.
Responsibilities:
- Manage and grow a high-performing team
- Identify and implement AI-driven efficiencies in the product lifecycle to accelerate platform delivery and engineering productivity
- Treat the reliability stack as a product
- Define the roadmap for the Incident Management platform, ensuring these tools reduce cognitive load for hundreds of service teams by replacing manual investigation steps with AI-assisted workflows
- Drive the integration of GenAI and SRE Agents into production environments
- Establish frameworks for validating AI-generated incident summaries and hypothesis generation to ensure accuracy and prevent automated hallucinations
- Define the vision for the next generation of Resilience Engineering, focusing on building services that make products inherently resilient through automated alert diagnostics and self-healing systems
- Act as a high-leverage partner to our key vendors, holding them accountable for roadmap delivery and ensuring their features align with our team vision
Requirements:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- 5+ years of experience leading SRE or DevOps teams in a high-scale, cloud-native environment
- Strong background in Software Engineering (Python, Go, or Java) and Infrastructure-as-Code
- Deep familiarity with incident management and AIOps tools (e.g., Rootly, PagerDuty, BigPanda)
- Experience evaluating and refining AI-generated outputs in a technical or operational context
- Proven ability to collaborate with SaaS partners to influence a collective product vision
- Comfort operating in an evolving, AI-augmented environment with a focus on continuous learning
- Experience with BCP/DR planning or Chaos Engineering
- Previous experience implementing large-scale AIOps or 'Self-Healing' infrastructure initiatives