Own SRE solutions end‑to‑end, from design and implementation to operation and continuous improvement
Use IaC and config management to standardize and automate provisioning everywhere
Deliver solutions in a globally distributed, multi‑cloud hybrid environment
Design for failure with redundancy, failure domains, progressive delivery, and strict change control
Ensure the highest level of uptime and Quality of Service (QoS)
Conduct capacity management and planning to meet ongoing operational needs
Detect performance issues and recommend solutions
Collaborate with various teams in a fast‑paced environment
Participate in on-call, incident reviews, assist in root cause identification, and produce high-quality RCA reports

B.S. degree in Computer Science or related technical field (or equivalent experience)
5+ years professional experience building and supporting critical services
Experience supporting large-scale HPC clusters using Slurm, LSF or Kubernetes clusters
Proficiency in modern CI/CD techniques, and Infrastructure as Code (IaC)
Strong experience crafting large-scale infrastructure platforms
Proficient in monitoring, metrics, container management, and log collection tools
5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Perl, or Ruby
Creative problem solver with excellent debugging skills and strong communication and documentation abilities.

Senior Site Reliability Engineer – HPC

Key skills