EnIn Systems is seeking a Site Reliability Engineer Senior Leader to enhance system uptime, performance, and scalability through a combination of software engineering and operational skills. The role involves leading teams, defining SLIs/SLOs, automating infrastructure, and managing incidents to ensure optimal reliability and cost efficiency in cloud operations.
Responsibilities:
- Leadership & Mentoring: Lead a team of SREs, manage sprint planning, and foster career growth
- System Reliability & Strategy: Own the uptime, performance, and capacity planning of production systems
- Automation & Tools: Reduce manual work (toil) by building automation, managing infrastructure as code (Terraform, Kubernetes), and enhancing observability
- Incident Management: Drive root cause analysis (RCA), lead incident responses, and implement post-mortem action items
- SLI/SLO Management: Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to balance velocity and reliability
Requirements:
- Proficiency in coding/scripting (e.g., Python, Go) and familiarity with CI/CD tools
- Strong knowledge of cloud platforms (AWS, GCP, Azure), Linux, networking, and containerization (Kubernetes)
- Proven experience leading technical teams and managing complex projects
- Ability to communicate technical SRE initiatives to stakeholders across the organization