NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. They are seeking a Senior ML Platform Engineer to architect, build, and scale high-performance ML infrastructure using Infrastructure-as-Code practices, while collaborating with researchers to streamline ML experimentation.
Responsibilities:
- Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters
- Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads
- Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices
- Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation
- Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols
- Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures
- Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes
- Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.)
Requirements:
- BS/MS in Computer Science, Engineering, or equivalent experience
- 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems
- Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure
- SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability
- Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment
- Proficiency in operating containerized workloads with Kubernetes and Docker
- Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code
- Experience with Linux systems internals, networking, and performance tuning at scale
- Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale
- Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL)
- Expertise with modern CI/CD methodologies and GitOps practices
- Passion for building developer-centric platforms with great UX and strong operational reliability
- Proven ability to contribute code to complex orchestration or automation platforms