Netflix is a leading entertainment company focused on pushing the boundaries of storytelling and technology. They are seeking a highly motivated Ads Reliability Engineer to ensure the reliability, resilience, and scalability of their Ads systems, directly impacting revenue and user experience.
Responsibilities:
- Design, implement, and maintain scalable and reliable infrastructure to support Netflix Ads Suite
- Collaborate with engineering and product teams to integrate observability, reliability, and security considerations into the entire software development lifecycle
- Coordinate capacity planning as we scale up Dynamic Ad Insertion for global-scale Netflix Live streaming
- Develop and implement automation tools for monitoring, deployment, and incident response to ensure efficient and reliable operations
- Participate in on-call rotations to ensure the 24/7 health of the Netflix Ad Suite and contribute to incident response, diagnosis, and resolution
- Implement and maintain a robust incident response framework, including blame-aware incident reviews to learn from operational surprises
- Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective
- Champion and embed a culture of reliability across the Ads organization. You will act as a force multiplier, scaling your technical expertise by creating clear documentation, developing best-practice guides, and building tooling to automatically roll out reliability enhancements
Requirements:
- 5+ years of experience as a Site Reliability Engineer (SRE), Production Engineer, or similar role supporting business-critical, high-traffic services
- Write code to solve problems. You are proficient in one or more languages like Python, Go, or Java and believe in automating solutions over manual effort
- Are fluent in modern cloud infrastructure. You have hands-on experience with cloud providers such as AWS/Azure/GCP, Infrastructure as Code such as Terraform, and container orchestration systems like Kubernetes
- Understand large-scale distributed systems, their common failure modes and edge cases
- Thrive on collaboration and influence. You have excellent communication skills and a proven ability to build relationships with and educate engineering partners
- Are a natural troubleshooter. You can calmly navigate complex production issues, identify root causes, and implement effective, lasting solutions
- Possess a growth mindset. You are relentlessly curious, committed to continuous improvement, and passionate about scaling your expertise
- Direct experience with Ad tech platforms, real-time bidding (RTB), DSPs, SSPs
- Direct experience with Dynamic Ad Insertion for Live Events
- Direct experience with systems that experience high-scale load spikes
- Experience with large-scale data pipelines or analytics platforms
- A track record of contributing to open-source projects in the reliability space