NVIDIA is the platform upon which every new AI‑powered application is built. They are seeking a Senior Software Engineer – Inference Platform Infrastructure to help build and automate the foundations that keep NVIDIA’s inference services running smoothly and reliably across thousands of GPUs.
Responsibilities:
- Build automation that makes inference at scale easy to operate: provisioning, configuration, upgrades, rollbacks, and routine maintenance—optimized for repeatability and safety
- Create and evolve deployment patterns for inference workloads on Kubernetes: rollouts, autoscaling, multi‑cluster patterns, GPU scheduling/isolation, and safe upgrade strategies
- Own platform reliability outcomes through software: define and improve SLIs/SLOs, error budgets, alert quality, and automated remediation for common failure modes
- Owning and operating a large fleet of NVIDIA GPU and Datacenter hardware from pre-release to production
Requirements:
- Strong software engineering skills; ability to build platforms and systems that our teams rely on
- 5+ years building and operating production distributed systems with strong ownership and a track record of improving reliability and eliminating toil
- Proven expertise in cloud-native platforms: Kubernetes, containers, service networking, configuration management, and modern CI/CD
- Deep experience with infrastructure‑as‑code and automation-first operations (e.g., GitOps workflows, policy enforcement, fleet management patterns)
- Excellent communication and collaboration skills; ability to lead cross‑functional efforts and drive improvements to completion
- BS/MS in Computer Science, Computer Engineering, or related field or equivalent experience
- Direct experience in operating inference serving at scale at scale (Triton, TensorRT‑LLM, KServe/Ray Serve, etc.)
- Built scheduling, placement, or quota systems (priority queues, fairness, admission control, rate limiting) for Kubernetes
- Built fleet health systems: telemetry pipelines, automated quarantine/drain, and hardware/software failure triage automation