Build automation that makes inference at scale easy to operate: provisioning, configuration, upgrades, rollbacks, and routine maintenance—optimized for repeatability and safety.
Create and evolve deployment patterns for inference workloads on Kubernetes: rollouts, autoscaling, multi‑cluster patterns, GPU scheduling/isolation, and safe upgrade strategies.
Own platform reliability outcomes through software: define and improve SLIs/SLOs, error budgets, alert quality, and automated remediation for common failure modes.
Owning and operating a large fleet of NVIDIA GPU and Datacenter hardware from pre-release to production.
Requirements
Strong software engineering skills; ability to build platforms and systems that our teams rely on.
5+ years building and operating production distributed systems with strong ownership and a track record of improving reliability and eliminating toil.
Proven expertise in cloud-native platforms: Kubernetes, containers, service networking, configuration management, and modern CI/CD.
Deep experience with infrastructure‑as‑code and automation-first operations (e.g., GitOps workflows, policy enforcement, fleet management patterns).
Excellent communication and collaboration skills; ability to lead cross‑functional efforts and drive improvements to completion.
BS/MS in Computer Science, Computer Engineering, or related field or equivalent experience.