NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. They are seeking a motivated Deep Learning engineer to integrate advanced CUDA features and Distributed Runtime technologies into AI stacks, while collaborating with teams to enhance performance and programmability of AI applications.
Responsibilities:
- Integrate new CUDA features and Runtime abstractions in AI frameworks: from PoC to performance analysis to production
- Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate in the lower layers of the stack. Collaborate hands-on with teams working on the latest AI models
- Own and drive improvements in the AI Compiler-Runtime interface to build speed-of-light multi-GPU multi-node solutions
- Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads
- Influence the roadmap of core CUDA to facilitate building next-gen DL frameworks
- Collaborate with a very dynamic team across multiple time zones
- Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts to co-design systems and frameworks that enhance performance and programmability
- Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning
- Write clean, effective, and maintainable code, ensuring exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products
Requirements:
- BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience)
- 8+ years of relevant industry experience or equivalent academic experience after completed degree
- Development experience with Deep Learning Frameworks such PyTorch, JAX, and Inference Engines such as TRT-LLM, vLLM, SGLang
- Rapid prototyping and development with Python, C++, CUDA or related DSLs
- Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g. torch.compile)
- Experience conducting performance benchmarking on AI clusters. Familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems)
- Understanding of HPC/AI communication concepts
- Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
- Adaptability and passion to learn new frameworks and tools
- Flexibility to work and communicate effectively across different teams and timezones
- Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e.g., PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc.)
- Hands-on experience with CUDA, specific communication libraries (e.g., NCCL, MPI, UCX) and distributed machine learning techniques (e.g., pipeline parallelism, tensor parallelism)
- Expertise in one or more of these areas: Training, Distributed inference, MoE, Reinforcement Learning, kernel authoring (on CUDA, Triton, cuTe, etc)
- Background in deep learning compilers, both graph-level and codegen (e.g., Triton, XLA, torch compile)
- Experience with programming for compute & communication overlap in distributed runtime