Build and maintain infrastructure-as-code (Terraform, Helm) for our AWS EKS and GCP GKE clusters, plus on-premises deployments (including Tanzu and air-gapped environments).
Own CI/CD pipelines (GitHub Actions, Bazel, ArgoCD) and drive GitOps adoption.
Deploy, scale, and optimize ML/NLP inference workloads (vLLM, PyTorch, GPU scheduling with various Kubernetes scalers).
Build and improve observability: Prometheus, Grafana, Datadog, and OpenTelemetry.
Collaborate with Field Engineering to support PoCs and platform deployments in customer cloud VPCs and on-prem environments.
Contribute to backend services (Java 21, Python, gRPC) and platform features.
Improve system reliability, scalability, and developer experience across the engineering org.
Requirements
2+ years in platform engineering, DevOps, SRE, or backend infrastructure roles.
Strong Kubernetes experience (deployment, debugging, scaling — not just kubectl apply).
Hands-on with infrastructure-as-code: Terraform, Helm, or Pulumi.
Experience with at least one major cloud provider (AWS preferred; GCP or Azure also valued).
Proficiency in one or more of: Go, Python, Java. Comfortable reading and contributing to backend codebases.
Working knowledge of CI/CD systems (GitHub Actions, Bazel, ArgoCD, or similar).
Solid fundamentals in Linux, networking, and distributed systems.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Google Cloud Platform
Grafana
GRPC
Java
Kubernetes
Linux
Prometheus
Python
PyTorch
Terraform
Go
Benefits
100% paid Medical, Dental, Vision for employees.
Option of Health Savings Account (HSA) or Flexible Savings Account (FSA).
Generous paid time off (PTO) plus paid sick time and holidays.
Professional development and training opportunities.
Company virtual happy hours and fun team building activities and more.