NVIDIA is a leader in AI and high-performance computing, and they are seeking Senior Software Engineers to help build and operate large-scale GPU infrastructure for AI research and production workloads. The role involves developing automation, tooling, and operational systems to ensure GPU clusters are reliable, scalable, and safe to run.
Responsibilities:
- Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments
- Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations
- Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations
- Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows
- Participate in on-call, incident response, debugging, and durable follow-up work
- Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready
Requirements:
- 8+ years of experience building or operating production infrastructure
- Strong programming skills in Python, Go, or similar
- Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation
- Ability to troubleshoot distributed systems in production
- Clear communication and ability to work across teams
- BS/MS in Computer Science or equivalent experience
- Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation
- Experience with SLOs, on-call, incident response, observability, and reliability practices
- Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure