Netflix is a leading entertainment company that pushes the boundaries of storytelling and technology. They are seeking a Software Engineer L5 to lead the automation and validation of their edge platform, ensuring the reliability and performance of their content delivery systems.
Responsibilities:
- Own and build scalable testing infrastructure and end-to-end automated validation for edge appliances, covering functional, resiliency, performance, and upgrade and rollback testing, with high reliability, strong observability, and clear release gates, including tests that validate the platform can meet scaling and performance requirements under production-like workloads
- Improve failure triage with AI-assisted tooling that reduces time-to-detection and time-to-resolution
- Lead and mentor engineers building and maintaining test automation and release qualification
- Partner with OS, security, hardware, and application teams to ensure validation keeps pace with rapid product development
- Debug complex regressions across hardware/firmware/OS boundaries and collaborate cross functionally to drive fixes to resolution
- Build dashboards and alerting for regression detection, performance drift, and release readiness
Requirements:
- 10+ years software engineering experience (or equivalent depth), including ownership of CI/CD systems and architecting large scale test automation
- Strong coding ability in Python, Rust and or Go, with comfort writing shell scripts
- Deep hands-on experience with Linux and/or FreeBSD in systems contexts (boot, networking and storage)
- Strong ability to design, build, and operate cloud services that support CI/CD and test automation, including maintaining service reliability, scalability, observability, and cost efficiency
- Experience designing automated test frameworks for reliability, performance, hardware-in-loop, integration testing
- Proven ability to provide technical leadership across teams through setting standards, mentoring, and owning roadmaps
- Experience with modern CI systems and build and release pipelines such as GitHub Actions, Jenkins or similar tools
- Strong debugging skills across distributed systems and low-level systems boundaries using logs, metrics, tracing, and performance tooling
- Proficiency working on highly distributed systems
- Using AI tools for operational triage (log clustering, anomaly detection), with a pragmatic approach (guardrails, fallback paths, auditability)
- Performance tooling: perf, flamegraphs, bpftrace/eBPF, dtrace (FreeBSD), fio, network benchmarking
- Contribute to and collaborate with relevant open-source communities
- Experience with hardware lab automation and fleet provisioning workflows such as PXE boot, imaging, remote power control, serial console access, and rack automation
- Experience with incident response practices including postmortems, root cause analysis, and driving preventative engineering actions
- Experience validating BIOS and firmware behavior, managing firmware rollouts, and working with hardware vendors on platform issues