Home
Jobs
Saved
Resumes
Senior Site Reliability Engineer – HPC at NVIDIA | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Senior Site Reliability Engineer – HPC
NVIDIA
Website
LinkedIn
Senior Site Reliability Engineer – HPC
United States
Full Time
2 hours ago
$152,000 - $287,500 USD
H1B Sponsor
Apply Now
Key skills
Cloud
Kubernetes
Perl
Python
Ruby
Go
CI/CD
Communication
About this role
Role Overview
Own SRE solutions end‑to‑end, from design and implementation to operation and continuous improvement
Use IaC and config management to standardize and automate provisioning everywhere
Deliver solutions in a globally distributed, multi‑cloud hybrid environment
Design for failure with redundancy, failure domains, progressive delivery, and strict change control
Ensure the highest level of uptime and Quality of Service (QoS)
Conduct capacity management and planning to meet ongoing operational needs
Detect performance issues and recommend solutions
Collaborate with various teams in a fast‑paced environment
Participate in on-call, incident reviews, assist in root cause identification, and produce high-quality RCA reports
Requirements
B.S. degree in Computer Science or related technical field (or equivalent experience)
5+ years professional experience building and supporting critical services
Experience supporting large-scale HPC clusters using Slurm, LSF or Kubernetes clusters
Proficiency in modern CI/CD techniques, and Infrastructure as Code (IaC)
Strong experience crafting large-scale infrastructure platforms
Proficient in monitoring, metrics, container management, and log collection tools
5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Perl, or Ruby
Creative problem solver with excellent debugging skills and strong communication and documentation abilities.
Tech Stack
Cloud
Kubernetes
Perl
Python
Ruby
Go
Benefits
Eligible for equity
Comprehensive benefits package
Apply Now
Home
Jobs
Saved
Resumes