NVIDIA is the platform upon which every new AI‑powered application is built. They are seeking a Senior Software Engineer – Inference Platform Infrastructure to help build and automate the foundations that keep NVIDIA’s inference services running smoothly and reliably across thousands of GPUs.

Responsibilities:

Build automation that makes inference at scale easy to operate: provisioning, configuration, upgrades, rollbacks, and routine maintenance—optimized for repeatability and safety
Create and evolve deployment patterns for inference workloads on Kubernetes: rollouts, autoscaling, multi‑cluster patterns, GPU scheduling/isolation, and safe upgrade strategies
Own platform reliability outcomes through software: define and improve SLIs/SLOs, error budgets, alert quality, and automated remediation for common failure modes
Owning and operating a large fleet of NVIDIA GPU and Datacenter hardware from pre-release to production

Requirements:

Strong software engineering skills; ability to build platforms and systems that our teams rely on
5+ years building and operating production distributed systems with strong ownership and a track record of improving reliability and eliminating toil
Proven expertise in cloud-native platforms: Kubernetes, containers, service networking, configuration management, and modern CI/CD
Deep experience with infrastructure‑as‑code and automation-first operations (e.g., GitOps workflows, policy enforcement, fleet management patterns)
Excellent communication and collaboration skills; ability to lead cross‑functional efforts and drive improvements to completion
BS/MS in Computer Science, Computer Engineering, or related field or equivalent experience
Direct experience in operating inference serving at scale at scale (Triton, TensorRT‑LLM, KServe/Ray Serve, etc.)
Built scheduling, placement, or quota systems (priority queues, fairness, admission control, rate limiting) for Kubernetes
Built fleet health systems: telemetry pipelines, automated quarantine/drain, and hardware/software failure triage automation

Senior Software Engineer – Inference Platform Infrastructure

Key skills

About this role

Responsibilities:

Requirements: