Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving
Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token
Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs
Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning
Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure
Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure
Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively
Requirements
5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering
Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems
Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads
Experience deploying and scaling open-source LLMs and embedding models using containerized solutions
Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash