DataCrunch is building a fully featured European AI cloud with a focus on renewable energy. They are seeking a Senior or Principal Site Reliability Engineer to enhance their HPC and cloud infrastructure, ensuring reliability and performance while collaborating with various teams to improve automation and deployment workflows.
Responsibilities:
- Ensure the reliability, scalability, and performance of HPC and cloud systems
- Build and maintain automation, observability, and monitoring frameworks for compute clusters
- Collaborate with ML, data, and infrastructure teams to deliver high-availability systems
- Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes
- Participate in architecture design and long-term infrastructure strategy discussions
- Participate in a 24/7 on-call rotation, with at least one full on-call week per month
Requirements:
- 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems
- Linux expertise (Ubuntu or Debian preferred)
- Strong experience with scripting and automation (Python, Go, Bash)
- Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius)
- Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible)
- Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs