Netflix is a leading entertainment company that is dedicated to pushing the boundaries of storytelling and technology. They are seeking a Senior Site Reliability Engineer to enhance the reliability and operational excellence of their streaming services, ensuring a seamless experience for their members. The role involves collaborating with engineering teams, designing resilient infrastructure, and implementing reliability metrics to support Netflix's high-quality service delivery.

Responsibilities:

Design and evolve resilient infrastructure for Netflix member-facing services, ensuring our systems are scalable, fault-tolerant, and operable at a global scale
Take a data-driven approach to reliability to identify and address systemic risk
Partner with engineering and product teams to embed reliability and observability into the full software development lifecycle—from design and readiness reviews through rollout and ongoing operations
Define and measure Service Level Objectives (SLOs) and other reliability metrics that matter to the member experience, using them to guide capacity planning, operational priorities, and tradeoffs between reliability, feature velocity, and cost
Build and improve automated processes for deployment, monitoring, capacity management, and incident response to ensure our operations are fast, reliable, and repeatable
Participate in on-call rotations for critical Streaming services, helping ensure 24/7 availability and a great member experience
Lead and contribute to incident response—from triage and mitigation through follow-ups—focusing on learning, systemic fixes, and avoiding repeat issues
Proactively identify and reduce sources of instability in distributed systems by analyzing how our systems actually fail in production and driving architectural or operational improvements
Champion a culture of reliability across business domains, acting as a force multiplier: creating clear documentation, developing best-practice guides, and building tooling that enables other teams to adopt reliability improvements at scale

Requirements:

5+ years of experience in an SRE, Production Engineering, or similar role operating business-critical, high-traffic services in production
Strong coding skills in one or more languages such as Python, Go, or Java, with a focus on automating solutions instead of relying on manual operations
Fluency in modern cloud infrastructure: hands-on experience with large-scale environments on AWS/Azure/GCP, along with abstracted compute and platform orchestration systems
Deep understanding of large-scale distributed systems, including common failure modes, performance bottlenecks, and how to design for resilience and graceful degradation
Track record of proactively identifying reliability risks and gaps through metrics, incidents, architecture reviews, or resilience testing—and implementing pragmatic, scalable solutions to mitigate them
Strong observability and performance tuning skills: you can use metrics, logs, and traces to debug issues in complex systems, and you're comfortable profiling and optimizing services to meet latency, availability, or efficiency goals
Experience with incident management and response: you can navigate ambiguous, high-pressure production issues, drive coordinated response, and follow through with durable improvements
Strong collaboration and influence skills: you communicate clearly, build trust with partner teams, and can guide engineering teams toward better reliability practices without relying on authority
Ability to balance reliability, velocity, and cost: you're comfortable making and explaining tradeoffs, and using data (SLOs, error budgets, performance metrics) to guide decision-making
Growth mindset and curiosity: you are eager to learn, comfortable challenging assumptions (including your own), and motivated by continuous improvement of systems, processes, and yourself
Embraces agency: you thrive when given a loosely defined goal by coming up with work to accomplish the goal while farming for dissent and feedback from the team and our stakeholders

Senior Site Reliability Engineer, CORE (Member Experience / Resilience Operations)

Key skills

About this role

Responsibilities:

Requirements: