NVIDIA is the platform for every new AI-powered application, seeking a Principal Software Engineer - AI Inference to advance open-source LLM serving. This role involves contributing to upstream inference engines and optimizing performance on NVIDIA GPUs and systems while collaborating with various teams to enhance the inference ecosystem.
Responsibilities:
- Drive upstream-first engineering in vLLM/SGLang: author and land PRs or equivalent experience, engage in development discussions, help compose roadmaps, and build durable maintainer relationships
- Build and implement inference-runtime features that improve efficiency, latency, and tail behavior: request scheduling, batching policies, KV-cache management (paging/sharding), memory planning, and streaming
- Optimize core hot paths across the stack—from Python orchestration down to C++/CUDA kernels—using profiling and measurement to guide decisions
- Improve multi-GPU and multi-node inference: communication patterns, parallelism strategies (tensor/sequence/pipeline), and system-level scaling/efficiency
- Strengthen correctness, robustness, and operability: determinism where needed, graceful degradation, backpressure, observability hooks, and performance regression testing
- Collaborate across NVIDIA to integrate upstream advances with production needs (deployment patterns, compatibility, security posture) while keeping changes broadly adoptable by the community
- Mentor senior engineers, raise the technical bar through build reviews, and establish guidelines for performance engineering and upstream contribution workflows
Requirements:
- 15+ years building production software with significant depth in systems engineering; strong track record of owning ambiguous, high-impact technical problems end-to-end
- Demonstrated expertise in LLM inference/serving systems (e.g., vLLM, SGLang) and the tradeoffs that drive real production performance
- Strong programming skills in Rust, C++, Python, CUDA; ability to read, modify, and optimize performance-critical code across layers
- Experience with GPU performance analysis tools and methodologies (profiling, microbenchmarking, memory/comms analysis) and a strong measurement culture
- Solid foundation in distributed systems and concurrency: queues/schedulers, RPC/streaming, multi-process/multi-threaded runtime behavior, and scaling patterns across nodes
- Excellent communication skills; ability to influence across teams and represent NVIDIA well in open-source technical forums
- BS/MS in Computer Science, Computer Engineering, or related field (or equivalent experience)
- Substantial open-source contributions to vLLM, SGLang, PyTorch, Triton, NCCL, or related GPU/inference infrastructure; prior maintainer experience is a plus
- Shipped performance features such as paged attention/KV paging, speculative decoding, advanced scheduling, quantization-aware serving, or low-latency streaming optimizations
- Experience optimizing inference across the full stack: tokenizer and Python runtime overheads, kernel fusion, memory bandwidth, PCIe/NVLink effects, and network fabrics (e.g., InfiniBand)
- Built robust benchmarking and regression infrastructure for latency and efficiency, including dataset selection, load modeling, and reproducible performance tracking