Home
Jobs
Saved
Resumes
Site Reliability Engineer at RunPod | JobVerse
JobVerse
Home
Jobs
Recruiters
Companies
Pricing
Blog
Jobs
/
Site Reliability Engineer
RunPod
Remote
Website
LinkedIn
Site Reliability Engineer
United States
Full Time
2 weeks ago
$150,000 - $200,000 USD
No Visa Sponsorship
Apply Now
Key skills
Distributed Systems
Grafana
Linux
Prometheus
Python
Go
Bash
AI
CI/CD
Leadership
Communication
Remote Work
About this role
Role Overview
Increase platform uptime and reduce incident frequency and duration
Establish and operationalize SLIs/SLOs across services
Improve MTTR through better tooling, automation, and runbooks
Strengthen production readiness standards
Drive long-term systemic reliability improvements
Define and implement SLIs/SLOs for critical services
Lead incident response and coordinate cross-team mitigation efforts
Conduct blameless postmortems and ensure corrective actions are completed
Perform production readiness reviews for new services and features
Identify systemic risks and drive preventative improvements
Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
Improve signal-to-noise ratio in alerts and reduce alert fatigue
Build internal tooling for reliability tracking and reporting
Improve visibility into GPU performance and distributed systems health
Automate recurring operational workflows
Build tools and scripts (Python, Go, Bash) to eliminate manual processes
Improve deployment safety through automation and guardrails
Strengthen CI/CD reliability and release processes
Partner with engineering teams to improve system resilience
Provide guidance on fault tolerance, scalability, and failure handling
Contribute to architectural discussions with a reliability-first mindset
Requirements
5+ years of experience in SRE, Reliability Engineering, or Production Engineering
Strong Linux systems and Networking expertise
Experience managing containerized production systems
Strong understanding of distributed systems and failure modes
Experience defining and managing SLIs/SLOs
Proven incident response and postmortem leadership experience
Strong scripting or programming skills
Experience with monitoring and alerting systems
Excellent written communication skills
Successful completion of a background check
Tech Stack
Distributed Systems
Grafana
Linux
Prometheus
Python
Go
Benefits
Meaningful equity in a fast-growing company
Generous medical, dental & vision plans
Flexible PTO
take the time you need to recharge
Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.
Apply Now
Home
Jobs
Saved
Resumes