NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. They are looking for a Software Engineer focused on bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms. The role involves debugging large-scale AI clusters, developing benchmarking tooling, and delivering data-driven recommendations based on profiling results.

Responsibilities:

Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads
Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks
Perform root-cause analysis of failures in large distributed environments
Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster
Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms
Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams
Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization

Requirements:

Bachelor's or Master's in Computer Science or a related technical field (or equivalent experience)
3+ years of experience developing software for AI, HPC, or systems-level applications
Hands-on experience with multi-GPU or multi-node workloads and CUDA-aware distributed execution
Background with debugging and scaling distributed systems
Experience debugging and triaging AI applications across the full stack, from the application level toward the hardware
Experience operating workloads in scheduled, containerized cluster environments
Excellent analytical, debugging, and communication skills, and a collaborative approach across teams
Strong Python and C/C++ programming skills
Hands-on experience with NCCL and CUDA-aware distributed execution
Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric) and with InfiniBand / RoCE congestion debugging
Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms, including MLPerf
Experience diagnosing performance jitter
Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure

Software Engineer, DGX Cloud AI Infrastructure

Key skills

About this role

Responsibilities:

Requirements: