NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run.
Responsibilities:
- Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates
- Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks
- Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks
- Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance
- Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments
- Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale
- Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms
- Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams
- Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization
- Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization