NVIDIA is a leading technology company specializing in edge AI for automotive and robotics. They are seeking a Senior Software Engineer to develop and optimize inference frameworks for large language models on embedded platforms, collaborating across teams to deliver high-performance solutions.
Responsibilities:
- Develop and evolve a state-of-the-art inference framework in modern C++ that extends TensorRT with autoregressive model serving capabilities, including speculative decoding, LoRA, MoE, and KV cache management
- Design and implement compiler and runtime optimizations tailored for transformer-based models running on constrained, real-time platforms
- Collaborate with teams across CUDA, kernel libraries, compilers, and robotics to deliver high-performance, production-ready solutions
- Contribute to CUDA kernel and operator development for critical transformer components such as attention, GEMM, and MoE
- Benchmark, profile, and optimize inference performance across diverse embedded and automotive environments
- Stay ahead of the rapidly evolving LLM/VLM ecosystem and bring emerging techniques into product-grade software
Requirements:
- BS, MS, PhD, or equivalent experience in Computer Science, Electrical/Computer Engineering, or a closely related field
- 4+ years of relevant software development experience
- Deep understanding of transformer models and inference optimization techniques (e.g., quantization, tensor parallelism, or memory-efficient scheduling)
- Proficient programming ability with modern C++ (C++11/14/17 and beyond)
- Familiarity with popular LLM frameworks and libraries such as TensorRT, TensorRT-LLM, vLLM, SGLang, MLC-LLM, or FlashInfer
- A track record of strong software design, execution, and collaboration across fields
- Demonstrated development experience or open-source contributions to LLM inference frameworks and libraries, such as SGLang, vLLM, or FlashInfer
- Proficiency with CUDA, including efficient kernel development, performance profiling, and GPU architecture fundamentals
- Prior work on autoregressive LLM serving systems, including speculative decoding or KV cache management
- Familiarity with compiler infrastructure for large language model inference
- Exposure to robotics or embedded AI pipelines, including optimizing for low-latency, resource-constrained systems