NVIDIA is seeking a Senior DL Algorithms Engineer to optimize and deploy deep learning models for efficient inference across diverse GPU platforms. The role involves collaboration with research scientists and engineers to integrate cutting-edge AI models from prototype to production.
Responsibilities:
- Optimize deep learning models for low-latency, high-throughput inference, with a focus on LLMs, VLMs, diffusion models, and World Foundation Models (WFMs) designed for physical AI applications
- Convert, deploy, and optimize models for efficient inference using frameworks such as TensorRT, TensorRT-LLM, vLLM, and SGLang
- Understand, analyze, profile, and optimize performance of deep learning and physical AI workloads on state-of-the-art NVIDIA GPU hardware and software platforms
- Implement and refine components and algorithms for efficient serving of LLMs, VLMs, and WFMs at datacenter scale, leveraging technologies like Dynamo
- Collaborate with research scientists, software engineers, and hardware specialists to ensure seamless integration of cutting-edge AI models from training to deployment
- Contribute to the development of automation and tooling for NVIDIA Inference Microservices (NIMs) and inference optimization, including creating automated benchmarks to track performance regressions
Requirements:
- Master's or PhD in Computer Science, Electrical Engineering, Computer Engineering, or a related field (or equivalent experience)
- 3+ years of professional experience in deep learning, applied machine learning, or physical AI development
- Strong foundation in deep learning algorithms, including hands-on experience with LLMs, VLMs, and multimodal generative models such as World Foundation Models
- Deep understanding of transformer architectures, attention mechanisms, and inference bottlenecks
- Proficient in building, optimizing, and deploying models using PyTorch or TensorFlow in production-grade environments
- Solid programming skills in Python and C++
- Experience with model quantization and modern inference optimization techniques (e.g., KV cache, in-flight batching, parallelization mapping)
- Strong fundamentals in GPU performance analysis and profiling tools (e.g., Nsight, nsys profiling)
- Familiarity with serving models using Triton Inference Server and PyTriton via Docker
- Proven experience deploying LLMs, VLMs, diffusion models, or World Foundation Models (WFMs) at scale in real-world applications, especially for robotics or autonomous vehicles
- Hands-on experience with model optimization and serving frameworks, such as: TensorRT, TensorRT-LLM, vLLM, SGLang, and ONNX
- Direct experience with NVIDIA Cosmos, Isaac Sim, Isaac Lab, or Omniverse platforms for synthetic data generation and physical AI simulation
- Experience with data curation pipelines and tools like NVIDIA NeMo Curator for large-scale video data processing and model post-training
- Deep understanding of distributed systems for large-scale model inference and serving