Red Hat is seeking a Senior Site Reliability Engineer (SRE) to develop, scale, and operate their OpenShift managed cloud services. The role involves contributing to running OpenShift at scale, enabling customer self-service, and automating processes to enhance efficiency.
Responsibilities:
- Contribute code to increase the scalability and reliability of the service
- Contribute software tests and participate in peer review to increase the quality of our codebase
- Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration
- Participate in a regular on-call schedule, including occasional paid weekends and holidays
- Practice sustainable incident response and blameless postmortems
- Resolve customer issues escalated from the Red Hat Global Support team
- Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve
- Collaborate with cross-functional teams to identify opportunities for AI integration within the software development lifecycle, driving continuous improvement and innovation in engineering practices; share use cases for successful experiments with stakeholders for broader use
Requirements:
- A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required
- You must have some experience programming in Python AND Golang
- You must have experience working with public clouds such as AWS, GCP, or Azure
- You must also have the ability to collaboratively troubleshoot and solve problems in a team setting
- Direct experience with Kubernetes or OpenShift is a MUST
- Some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems
- Demonstrated ability to debug, optimize code and automate routine tasks
- A solid understanding of Unix/Linux operating systems
- 5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
- 1+ year(s) of experience with Kubernetes is a MUST
- 3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
- 3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
- 2+ years of experience programming with at least one object-oriented language; Golang AND Python
- 2+ years of experience delivering a hosted service
- Demonstrated ability to quickly and accurately troubleshoot system issues
- Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
- Solid communications skills and experience working directly with and presenting to customers
- 1+ year(s) of experience with docker-based containers is a plus