NVIDIA is the platform upon which every new AI‑powered application is built, and they are seeking a Senior Software Engineer – AI Inference to advance open‑source LLM serving. The role involves contributing to upstream inference engines and improving the underlying stack for high‑throughput, low‑latency inference at scale.
Responsibilities:
- Contribute features, fixes, and optimizations upstream to vLLM/SGLang: author PRs, participate in reviews, write benchmarks/tests, and help drive designs to completion
- Implement and optimize inference‑runtime capabilities: batching and scheduling policies, streaming, request lifecycle management, and KV‑cache efficiency (paging/sharding) to improve throughput and tail latency
- Profile and improve hot paths across layers-from Python orchestration to C++/CUDA kernels-using data to guide optimization work
- Improve multi‑GPU inference performance and reliability: parallelism strategies, communication patterns, and resource utilization across NVIDIA platforms
- Build and maintain performance and correctness regression tests to prevent slowdowns and ensure stable behavior across model and hardware configurations
- Collaborate with model, platform, and SRE teams to translate production requirements into upstreamable solutions with strong operability and maintainability
Requirements:
- 5+ years building production software with solid systems engineering fundamentals and a track record of delivering performance or reliability improvements
- Experience with LLM inference/serving stacks (e.g., vLLM, SGLang) and an understanding of the tradeoffs that drive real production performance
- Strong programming skills in Python plus C++ and/or CUDA; ability to debug and optimize performance‑critical code
- Experience with profiling and performance investigation (microbenchmarks, flame graphs, GPU profiling) and a measurement‑driven mindset
- Familiarity with distributed systems concepts and concurrency (queues/schedulers, multi‑process/multi‑threading, scaling across GPUs/nodes)
- Strong communication skills and comfort working with open‑source communities (issues, PR discussions, code review)
- BS/MS in Computer Science, Computer Engineering, or related field (or equivalent experience)
- Open‑source contributions to vLLM, SGLang, PyTorch, Triton, NCCL, Dynamo or adjacent serving/runtime projects
- Shipped performance work such as improved attention/KV cache efficiency, speculative decoding, scheduler improvements, quantization-aware serving, or streaming latency reductions
- Experience building reproducible benchmarking and performance regression infrastructure for latency/throughput
- Systems performance background spanning memory bandwidth, kernel fusion, PCIe/NVLink effects, and network fabrics (e.g., InfiniBand)