Red Hat is the world’s leading provider of enterprise open source software solutions, using a community-powered approach to deliver high-performing Linux, cloud, container, and Kubernetes technologies. The Senior Site Reliability Engineer will drive service performance, tackle complex technical challenges, and enhance intelligent orchestration and self-healing capabilities across the platform, while mentoring junior engineers.

Responsibilities:

Design, write, and maintain software (primarily in Python and Golang) that automates the deployment, monitoring, and maintenance of Red Hat managed services
Onboarding of new services onto our OpenShift-based platform
Adhering to cloud-native design principles & best practices to ensure reliability, scalability, and security
Contribute to documents, like standard operating procedures (SOPs) and playbooks, that assist in issue resolution and new-service onboarding
Proactively utilize AI-assisted development tools (e.g., GitHub Copilot, Cursor, Claude Code) for code generation, auto-completion, and intelligent suggestions to accelerate development cycles and enhance code quality
Participate in an Agile Scrum team that scopes, prioritizes, and allocates work items
Participate in an on-call rotation that is responsible for responding to service incidents

Requirements:

3+ years of relevant work experience
Background writing object-oriented automation software in Python, experience with Golang is only plus
Background administering production cloud-native services, preferably containerized and deployed via a container-orchestration system like Kubernetes or OpenShift
Experience diagnosing service failures and carrying out incident response procedures
Familiarity with Linux operating system and its configuration
Ability to effectively work in a globally distributed team
Understanding of computer networking and protocols, including TCP/IP and DNS
Understanding of computer security and cryptography basics, including certificates, TLS, and credential-storage systems like Vault is a plus
Familiarity with CI/CD pipeline concepts and systems, like Jenkins and Tekton/Argo is a plus
Familiarity with observability tools like Prometheus and Grafana, and how to define metrics that can be used to measure service health and reliability is a plus

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: