SRM Digital LLC is seeking a highly skilled Site Reliability Engineer (SRE) with deep expertise in Microsoft Azure and AKS to drive reliability, scalability, and operational excellence across enterprise-scale platforms. The role focuses on designing resilient cloud architectures, implementing observability frameworks, and ensuring high availability of mission-critical services.
Responsibilities:
- Define and enforce SLOs, SLIs, and Error Budgets across Tier-0/Tier-1 services
- Lead architecture reviews ensuring high availability, RTO/RPO compliance
- Implement chaos engineering practices using Azure Chaos Studio
- Design and validate Disaster Recovery (DR) strategies and conduct regular DR drills
- Act as Incident Commander for critical incidents (P1/P2)
- Manage end-to-end incident lifecycle and lead Post-Incident Reviews (PIRs)
- Support global on-call rotations with strict response SLAs
- Promote blameless post-mortem culture and drive remediation actions
- Design and manage enterprise observability platforms using Azure Monitor, Log Analytics, Application Insights, and Grafana
- Build alerting and incident integrations with PagerDuty and ServiceNow
- Develop automation, runbooks, and self-healing solutions using Azure Automation, Logic Apps, and scripting
- Embed reliability practices into Azure DevOps pipelines (performance testing, canary/blue-green deployments)
- Manage and optimize AKS clusters across regions, including scaling, upgrades, and security hardening
- Enforce infrastructure standards via Terraform/Bicep code reviews
Requirements:
- 7+ years of experience in SRE, Platform Engineering, or Cloud Infrastructure within large-scale enterprise environments
- 4+ years of hands-on experience with Microsoft Azure in a primary engineering role
- Expert-level experience with Azure Kubernetes Service (AKS), including cluster lifecycle, RBAC, network policies, autoscaling, and security standards
- Strong proficiency in Infrastructure as Code (Terraform required; Bicep preferred) and Azure Landing Zone architectures
- Programming/scripting expertise in Python (preferred), Go, or PowerShell
- Hands-on experience with Azure Monitor, Log Analytics, and Application Insights for enterprise observability
- Proven experience defining and managing SLOs, SLIs, and Error Budgets
- Strong understanding of Azure networking (Hub-Spoke, Virtual WAN, ExpressRoute, Azure Firewall, NSGs, Private Endpoints, DNS)
- AZ-104, AZ-305, AZ-400
- Certified Kubernetes Administrator (CKA)
- ITIL v4 Foundation