About this role

EnIn Systems is seeking a Site Reliability Engineer Senior Leader to enhance system uptime, performance, and scalability through a combination of software engineering and operational skills. The role involves leading teams, defining SLIs/SLOs, automating infrastructure, and managing incidents to ensure optimal reliability and cost efficiency in cloud operations.

Responsibilities:

Leadership & Mentoring: Lead a team of SREs, manage sprint planning, and foster career growth
System Reliability & Strategy: Own the uptime, performance, and capacity planning of production systems
Automation & Tools: Reduce manual work (toil) by building automation, managing infrastructure as code (Terraform, Kubernetes), and enhancing observability
Incident Management: Drive root cause analysis (RCA), lead incident responses, and implement post-mortem action items
SLI/SLO Management: Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to balance velocity and reliability

Requirements:

Proficiency in coding/scripting (e.g., Python, Go) and familiarity with CI/CD tools
Strong knowledge of cloud platforms (AWS, GCP, Azure), Linux, networking, and containerization (Kubernetes)
Proven experience leading technical teams and managing complex projects
Ability to communicate technical SRE initiatives to stakeholders across the organization

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: