NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. They are looking for a Software Engineer focused on bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms. The role involves debugging large-scale AI clusters, developing benchmarking tooling, and delivering data-driven recommendations based on profiling results.
Responsibilities:
- Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads
- Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks
- Perform root-cause analysis of failures in large distributed environments
- Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster
- Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms
- Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams
- Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization
Requirements:
- Bachelor's or Master's in Computer Science or a related technical field (or equivalent experience)
- 3+ years of experience developing software for AI, HPC, or systems-level applications
- Hands-on experience with multi-GPU or multi-node workloads and CUDA-aware distributed execution
- Background with debugging and scaling distributed systems
- Experience debugging and triaging AI applications across the full stack, from the application level toward the hardware
- Experience operating workloads in scheduled, containerized cluster environments
- Excellent analytical, debugging, and communication skills, and a collaborative approach across teams
- Strong Python and C/C++ programming skills
- Hands-on experience with NCCL and CUDA-aware distributed execution
- Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric) and with InfiniBand / RoCE congestion debugging
- Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms, including MLPerf
- Experience diagnosing performance jitter
- Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure