Red Hat is the world’s leading provider of enterprise open source software solutions, and they are seeking a Senior Site Reliability Engineer. This role is responsible for driving the reliability, performance, and scalability of services, mentoring junior engineers, and leading incident response efforts.
Responsibilities:
- Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability
- Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase
- Work to solve moderately complex issues, making decisions that impact the service's reliability and performance
- Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement
- Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution
- Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues
- Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives
- Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals
Requirements:
- A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required
- Hands-on experience that demonstrates your ability and interest in Site Reliability Engineering may be considered in lieu of degree requirements
- Some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language
- Experience working with public clouds such as AWS, GCP, or Azure
- Ability to collaboratively troubleshoot and solve problems in a team setting
- Some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.)
- Some experience working with complex distributed systems
- Basic understanding of Unix/Linux operating systems
- 5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
- 3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
- 3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
- 2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
- 2+ years of experience delivering a hosted service
- Demonstrated ability to quickly and accurately troubleshoot system issues
- Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
- Solid communications skills and experience working directly with and presenting to customers
- 1+ year(s) of experience with Kubernetes is a plus
- 1+ year(s) of experience with docker-based containers is a plus