About this role

NVIDIA is a leader in AI and high-performance computing, and they are seeking Senior Software Engineers to help build and operate large-scale GPU infrastructure for AI research and production workloads. The role involves developing automation, tooling, and operational systems to ensure GPU clusters are reliable, scalable, and safe to run.

Responsibilities:

Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments
Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations
Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations
Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows
Participate in on-call, incident response, debugging, and durable follow-up work
Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready

Requirements:

8+ years of experience building or operating production infrastructure
Strong programming skills in Python, Go, or similar
Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation
Ability to troubleshoot distributed systems in production
Clear communication and ability to work across teams
BS/MS in Computer Science or equivalent experience
Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation
Experience with SLOs, on-call, incident response, observability, and reliability practices
Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure

Senior Software Engineer, DGX Cloud Production Engineering

Key skills

About this role

Responsibilities:

Requirements: