Andromeda is a company focused on making AI infrastructure accessible to early-stage startups. They are seeking a Site Reliability Engineer to provision and operate Kubernetes-based clusters, improve reliability and scalability, and collaborate with teams to deliver infrastructure for new services.

Responsibilities:

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
Build automation and tooling to streamline cluster deployments and integrations
Debug customer issues across networking, storage, scheduling, and system layers
Improve reliability and scalability of both training and inference infrastructure
Design and implement monitoring, alerting, and observability for critical systems
Collaborate with engineering and product teams to plan and deliver infrastructure for new services
Participate in on-call and incident response, leading postmortems and reliability improvements

Requirements:

5+ years experience in SRE, DevOps, or infrastructure engineering roles
Strong Linux systems and networking fundamentals
Deep experience with Kubernetes and container orchestration at scale
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
Strong automation and scripting skills (Python, Go, or Bash)
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
Track record of operating production systems and leading incident response
Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.)
Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph)
Customer-facing support or consulting experience

Site Reliability Engineer - AI Infrastructure

Key skills

About this role

Responsibilities:

Requirements: