Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT) and text-to-speech (TTS). They are seeking an experienced Site Reliability Engineer to build and operate the hybrid infrastructure foundation for advanced AI/ML research and product development.
Responsibilities:
- Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services
- Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated
- Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources
- Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing
- Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments
- Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning
- Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle
- Automate the life cycle of single-tenant, managed deployments
Requirements:
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- Proven, hands-on experience building and managing production infrastructure with Terraform
- Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
- Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
- Strong scripting and automation skills (e.g., Python, Go, Bash)
- Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling
- Familiarity with FinOps principles and cloud cost optimization strategies
- Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions
- Experience in a multi-region or hybrid cloud environment