Role Overview

Architect, design and lead the implementation of the RHIVOS product SRE initiative.
Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
Review team contributions to software correcting errors and provide constructive feedback.
Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
Configure and maintain software production infrastructure and tooling.
Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
Collaborate on incident retrospective reviews and corrective items implementation.
Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
Helpout/backup RHIVOS Raleigh lab SRE when needed.

Requirements

8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
Linux administration expertise
Advanced experience of Kubernetes/OpenShift administration and application development
Advanced experience of automation services like Ansible or Terraform
Advanced experience of CI/CD platforms like GitLab CI, Tekton and Pipelines as a code (optionally GitHub Actions etc)
Advanced experience and experience with monitoring platforms and technologies
Advanced experience and experience of AWS technologies
Experience with open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability
Previous experience with the Site Reliability Engineer (SRE) model and software development using Python or GoLang.
Ability to work in the Raleigh office when needed

Tech Stack

Ansible
AWS
Grafana
Kubernetes
Linux
Open Source
OpenShift
Prometheus
Python
Terraform

Benefits

Comprehensive medical, dental, and vision coverage
Flexible Spending Account
healthcare and dependent care
Health Savings Account
high deductible medical plan
Retirement 401(k) with employer match
Paid time off and holidays
Paid parental leave plans for all new parents
Leave benefits including disability, paid family medical leave, and paid military leave
Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!

Principal Site Reliability Engineer – Automotive

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits