NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. They are seeking a motivated Deep Learning engineer to integrate advanced CUDA features and Distributed Runtime technologies into AI stacks, while collaborating with teams to enhance performance and programmability of AI applications.

Responsibilities:

Integrate new CUDA features and Runtime abstractions in AI frameworks: from PoC to performance analysis to production
Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate in the lower layers of the stack. Collaborate hands-on with teams working on the latest AI models
Own and drive improvements in the AI Compiler-Runtime interface to build speed-of-light multi-GPU multi-node solutions
Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads
Influence the roadmap of core CUDA to facilitate building next-gen DL frameworks
Collaborate with a very dynamic team across multiple time zones
Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts to co-design systems and frameworks that enhance performance and programmability
Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning
Write clean, effective, and maintainable code, ensuring exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products

Requirements:

BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience)
8+ years of relevant industry experience or equivalent academic experience after completed degree
Development experience with Deep Learning Frameworks such PyTorch, JAX, and Inference Engines such as TRT-LLM, vLLM, SGLang
Rapid prototyping and development with Python, C++, CUDA or related DSLs
Solid grasp of AI models, parallelisms, and/or compiler technologies (e.g. torch.compile)
Experience conducting performance benchmarking on AI clusters. Familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems)
Understanding of HPC/AI communication concepts
Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
Adaptability and passion to learn new frameworks and tools
Flexibility to work and communicate effectively across different teams and timezones
Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e.g., PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc.)
Hands-on experience with CUDA, specific communication libraries (e.g., NCCL, MPI, UCX) and distributed machine learning techniques (e.g., pipeline parallelism, tensor parallelism)
Expertise in one or more of these areas: Training, Distributed inference, MoE, Reinforcement Learning, kernel authoring (on CUDA, Triton, cuTe, etc)
Background in deep learning compilers, both graph-level and codegen (e.g., Triton, XLA, torch compile)
Experience with programming for compute & communication overlap in distributed runtime

Senior Deep Learning Frameworks CUDA Software Engineer

Key skills

About this role

Responsibilities:

Requirements: