Design and operate ML infrastructure: Manage data, training, serving, and inference systems for high-throughput model workflows
Build scalable pipelines: Implement reproducible training and evaluation pipelines with versioning, scheduling, and artifact tracking
Optimize compute and cost: Tune GPU and CPU workloads, manage clusters, and drive efficiency via rightsizing, spot scheduling, and caching
Serve models in production: Operate APIs for low-latency inference with autoscaling, blue-green or canary rollouts, and rollback safety
Ensure reliability and observability: Define and own SLOs; instrument pipelines and services to track latency, cost, drift, and data quality
Secure and automate: Manage IAM, secrets, and container security; automate deployment pipelines via CI/CD and infrastructure as code
Collaborate cross-functionally: Partner with research scientists and AI engineers to deliver models from experiment to production with minimal friction
Document and enable: Build templates, runbooks, and internal tooling that make ML workflows repeatable, safe, and fast
Requirements
4+ years of experience in ML platform, DevOps, or infrastructure engineering
Deep knowledge of Kubernetes, CI/CD, containers, and cloud infrastructure (AWS, GCP, or Azure)
Hands-on experience managing GPU clusters and training/inference pipelines
Familiarity with data orchestration and storage formats (Delta, Parquet, Polars, Spark)
Proven ability to ship and operate production ML systems with SLOs
Strong Python skills and comfort with infrastructure as code and automation
Experience with observability and cost optimization at scale