Home
Jobs
Saved
Resumes
Senior/Principal Site Reliability Engineer at DataCrunch | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Senior/Principal Site Reliability Engineer
DataCrunch
Remote
Website
LinkedIn
Senior/Principal Site Reliability Engineer
Germany
Full Time
3 hours ago
No Sponsorship
Apply Now
Key skills
Ansible
AWS
Azure
Cloud
Distributed Systems
DNS
Google Cloud Platform
Linux
Python
Terraform
Go
Bash
ML
GCP
Google Cloud
Lambda
CI/CD
About this role
Role Overview
Ensure the reliability, scalability, and performance of HPC and cloud systems
Build and maintain automation, observability, and monitoring frameworks for compute clusters
Collaborate with ML, data, and infrastructure teams to deliver high-availability systems
Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes
Participate in architecture design and long-term infrastructure strategy discussions
Participate in a 24/7 on-call rotation, with at least one full on-call week per month.
Requirements
7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems
Linux expertise (Ubuntu or Debian preferred)
Strong experience with scripting and automation (Python, Go, Bash)
Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius)
Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible)
Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.
Tech Stack
Ansible
AWS
Azure
Cloud
Distributed Systems
DNS
Google Cloud Platform
Linux
Python
Terraform
Go
Benefits
Generous cash + equity compensation
Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)
Apply Now
Home
Jobs
Saved
Resumes