Red Hat is the world’s leading provider of enterprise open source software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. The Senior Site Reliability Engineer will drive service performance, tackle complex technical challenges, and enhance intelligent orchestration and self-healing capabilities across the platform, while mentoring junior engineers.
Responsibilities:
- Design, write, and maintain software (primarily in Python and Golang) that automates the deployment, monitoring, and maintenance of Red Hat managed services
- Onboarding of new services onto our OpenShift-based platform
- Adhering to cloud-native design principles & best practices to ensure reliability, scalability, and security
- Contribute to documents, like standard operating procedures (SOPs) and playbooks, that assist in issue resolution and new-service onboarding
- Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality
- Participate in an Agile Scrum team that scopes, prioritizes, and allocates work items
- Participate in an on-call rotation that is responsible for responding to service incidents
Requirements:
- 3+ years of relevant work experience
- Background writing object-oriented automation software in Python, experience with Golang is only plus
- Background administering production cloud-native services, preferably containerized and deployed via a container-orchestration system like Kubernetes or OpenShift
- Experience diagnosing service failures and carrying out incident response procedures
- Familiarity with Linux operating system and its configuration
- Ability to effectively work in a globally distributed team
- Understanding of computer networking and protocols, including TCP/IP and DNS
- Understanding of computer security and cryptography basics, including certificates, TLS, and credential-storage systems like Vault is a plus
- Familiarity with CI/CD pipeline concepts and systems, like Jenkins and Tekton/Argo is a plus
- Familiarity with observability tools like Prometheus and Grafana, and how to define metrics that can be used to measure service health and reliability is a plus