Andromeda is a company focused on making AI infrastructure accessible to early-stage startups. They are seeking a Site Reliability Engineer to provision and operate Kubernetes-based clusters, improve reliability and scalability, and collaborate with teams to deliver infrastructure for new services.
Responsibilities:
- Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
- Build automation and tooling to streamline cluster deployments and integrations
- Debug customer issues across networking, storage, scheduling, and system layers
- Improve reliability and scalability of both training and inference infrastructure
- Design and implement monitoring, alerting, and observability for critical systems
- Collaborate with engineering and product teams to plan and deliver infrastructure for new services
- Participate in on-call and incident response, leading postmortems and reliability improvements
Requirements:
- 5+ years experience in SRE, DevOps, or infrastructure engineering roles
- Strong Linux systems and networking fundamentals
- Deep experience with Kubernetes and container orchestration at scale
- Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
- Strong automation and scripting skills (Python, Go, or Bash)
- Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
- Track record of operating production systems and leading incident response
- Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.)
- Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph)
- Customer-facing support or consulting experience