NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. We are seeking a Software Engineer to join our MARS team at NVIDIA, where you will help design, build, and operate exascale infrastructure that powers AI research and development at unprecedented scale.
Responsibilities:
- Design, develop, and operate distributed systems that manage data, compute, and networking for large-scale AI workloads
- Build software and automation to orchestrate workloads across thousands of GPUs and petabytes of storage in multi-region clusters
- Collaborate with AI/ML research teams to understand their requirements and translate them into scalable, high-performance solutions
- Drive improvements in system reliability, performance, and observability to meet exascale standards
- Partner with security, networking, and platform teams to ensure that MARS infrastructure meets the highest standards of robustness and compliance
- Participate in design reviews, contribute to system architecture discussions, and influence the evolution of NVIDIA’s AI infrastructure stack
- Stay current with advances in distributed systems, large-scale computing, and AI frameworks to help shape the future direction of MARS
Requirements:
- BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field
- 5+ years of experience developing and operating large-scale distributed systems, infrastructure platforms, or HPC environments
- Strong programming skills in C++, Python, or Go, with proven experience designing production-quality software systems
- Solid understanding of distributed systems principles, data management, and large-scale orchestration frameworks
- Hands-on experience with high-performance storage (e.g., Lustre, GPFS, BeeGFS) and compute scheduling and orchestration (e.g., Slurm, Kubernetes, LSF)
- Familiarity with cloud environments (Azure, AWS, GCP) and infrastructure automation tools
- Strong problem-solving skills, ownership mindset, and the ability to thrive in a fast-paced, collaborative environment
- Excellent communication skills and a track record of cross-functional collaboration
- Graduate degree (MS/PhD or equivalent experience) in Computer Science, Distributed Systems, or a related field
- Expertise in large-scale data management, cluster scheduling, or workload orchestration at exascale scale
- Experience building or maintaining infrastructure for AI/ML research, including distributed training pipelines using PyTorch, JAX, or NeMo
- Familiarity with data security, compliance, and lifecycle management for research-scale datasets
- Demonstrated leadership in system architecture design, performance optimization, or reliability engineering