NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its GeForce Now (GFN) team. The SRE ensures that GPU cloud gaming services maintain reliability and uptime, while enabling developers to make changes to the system through careful planning. Responsibilities include improving service observability, automating tasks, and supporting production systems.

Responsibilities:

Working on building tools to improve the SRE Observability
Be part of the Kubernetes migration journey with VMI setup and problem solving
Rapidly debug and triage incidents and user-reported issues
Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews
Be part of an on call rotation to support production systems

Requirements:

MS or BS in Computer Science/Engineering or a related field or equivalent experience
8+ year's Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling
Very strong Kubernetes background and ability to understand Kubernetes with complex and highly available VMI setup on K8's
Lead significant production improvements including change management, post-mortem reviews, workflow processes, design and deliver software automation in various languages
Confirmed strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line
Previous experience with Datadog, Prometheus, Alertmanager, or similar monitoring systems
Experience managing multi-region cloud deployments on hyperscalers like AWS, GCP, or Azure
Experience designing and managing deployment pipelines using tools such as GitHub Actions, GitLab CI, or ArgoCD
Excellent communication, presentation, social, and analytical skills; the ability to communicate complex interaction concepts clearly and persuasively across different audiences and varying levels of the organization
Production-grade coding proficiency in languages like Go, Python, or robust Bash scripting
Production on-call experience is a must. Should have served in a primary production on-call rotation, responding to and mitigating high-severity infrastructure alerts and service degradations
Experience working with automated anomaly detection, log clustering tools, or LLM-assisted debugging platforms
Comfortable using AI on a day-to-day basis as an SRE
Prior experience as an SRE or Service Engineer is a huge plus

Senior Site Reliability Engineer, GeForce NOW

Key skills

About this role

Responsibilities:

Requirements: