Red Hat is the world’s leading provider of enterprise open source software solutions, and they are seeking a Senior Site Reliability Engineer. This role is responsible for driving the reliability, performance, and scalability of services, mentoring junior engineers, and leading incident response efforts.

Responsibilities:

Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability
Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase
Work to solve moderately complex issues, making decisions that impact the service's reliability and performance
Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement
Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution
Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues
Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives
Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals

Requirements:

A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required
Hands-on experience that demonstrates your ability and interest in Site Reliability Engineering may be considered in lieu of degree requirements
Some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language
Experience working with public clouds such as AWS, GCP, or Azure
Ability to collaboratively troubleshoot and solve problems in a team setting
Some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.)
Some experience working with complex distributed systems
Basic understanding of Unix/Linux operating systems
5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
2+ years of experience delivering a hosted service
Demonstrated ability to quickly and accurately troubleshoot system issues
Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
Solid communications skills and experience working directly with and presenting to customers
1+ year(s) of experience with Kubernetes is a plus
1+ year(s) of experience with docker-based containers is a plus

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: