NVIDIA is a leading technology company specializing in graphics processing units and AI. They are seeking a Senior Site Reliability Engineer to design, implement, and support the operational and reliability aspects of their large-scale Observability & Telemetry platform, ensuring maximum reliability and uptime for their GPU cloud services.
Responsibilities:
- Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
- Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Practice sustainable incident response and blameless postmortems
- Be part of an on call rotation to support production systems
Requirements:
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
- 8+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production
- 5+ years experience delivering foundational infrastructure and observability platforms
- Experience in one or more of the following: Python, Go, Perl or Ruby
- In depth knowledge on Linux, Networking and Containers
- Interest in crafting, analyzing and fixing large-scale distributed systems
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks
- Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. Experience running Grafana, OpenTelemetry, Prometheus, and similar observability focused tools