Implement and mature SRE principles across operations: SLO / SLA definition and tracking, Error budgets, Reliability engineering practices
Drive automation-first approach to reduce manual intervention
Perform root cause analysis (RCA) and continuous improvement initiatives
Reduce incident volume and MTTR through engineering-driven solutions
Collaborate in blameless postmortems and reliability reviews
Requirements
Strong experience in Azure cloud operations and engineering, including management of IaaS/PaaS services in enterprise environments
Proven background in hybrid infrastructure environments (on-prem + Azure) with solid understanding of networking, identity, and security
Hands-on expertise in Infrastructure as Code (IaC) using Bicep (or ARM/Terraform), with focus on standardization, automation, and governance
Experience designing and implementing CI/CD pipelines using Azure DevOps, including automation of infrastructure and application deployments
Solid knowledge of cloud observability and monitoring practices, using tools such as Azure Monitor, Log Analytics, Application Insights, and/or Grafana
Experience applying Site Reliability Engineering (SRE) principles, including SLO/SLA management, incident reduction, and automation-driven operations
Strong troubleshooting and problem-solving skills across cloud and platform services, with ability to manage complex production environments
Familiarity with ITIL-based operational processes (Incident, Problem, Change), ensuring alignment between operations and engineering
Experience driving cloud optimization initiatives, including performance, scalability, and cost (FinOps awareness)
Strong collaboration and communication skills, with ability to work across engineering, operations, and customer stakeholders.