Ventures Unlimited Inc is looking for a Site Reliability Engineer to enhance their Azure-hosted services' reliability and performance. The role involves defining service level objectives, leading architectural reviews, and implementing observability and incident management practices.
Responsibilities:
- Define, own, and enforce enterprise-wide SLOs, SLIs, and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly
- Lead architectural reviews for new services and ensure reliability non-functionals (availability targets, RTO/RPO) are embedded from design through to production
- Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks
- Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions
- Serve as Incident Commander for P1/P2 major incidents, own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR)
- Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of <5 minutes for Tier-0 services
- Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA
- Design and operate the enterprise observability stack: Azure Monitor, Log Analytics Workspaces, Application Insights, and Azure Managed Grafana; ensure full MELT (Metrics, Events, Logs, Traces) coverage
- Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow
- Develop and operate platform automation, runbooks, and self-healing capabilities using Azure Automation, Logic Apps, and Python/PowerShell scripting
- Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines; automated performance testing, synthetic monitoring, and progressive deployment (canary/blue-green) strategies
- Manage reliability of AKS clusters across multiple Azure regions, own node pool scaling, upgrade strategy and cluster hardening in alignment with CIS Benchmarks
- Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones
Requirements:
- 7+ years of experience in SRE, platform engineering, or cloud infrastructure engineering in large-scale enterprise environments (10,000+ employees or equivalent complexity)
- Deep, hands-on expertise with Microsoft Azure — minimum 4 years in a primary Azure cloud engineering role
- Expert-level proficiency with AKS: cluster lifecycle management, RBAC, network policies, pod security standards, cluster autoscaler, and Workload Identity
- Strong infrastructure-as-code skills: Terraform (required) and/or Bicep; experience managing Azure Landing Zones or Enterprise-Scale architecture
- Proficiency in at least one systems programming/scripting language: Python (preferred), Go, or PowerShell
- Experience designing and operating enterprise observability platforms using Azure Monitor, Log Analytics and Application Insights at scale
- Demonstrable track record of owning SLOs/SLIs and delivering measurable reliability improvements in production
- Strong knowledge of enterprise networking in Azure: Hub-and-Spoke/Virtual WAN, ExpressRoute, Azure Firewall, NSGs, Private Endpoints, and DNS Private Zones
- Define, own, and enforce enterprise-wide SLOs, SLIs, and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly
- Lead architectural reviews for new services and ensure reliability non-functionals (availability targets, RTO/RPO) are embedded from design through to production
- Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks
- Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions
- Serve as Incident Commander for P1/P2 major incidents, own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR)
- Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of <5 minutes for Tier-0 services
- Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA
- Design and operate the enterprise observability stack: Azure Monitor, Log Analytics Workspaces, Application Insights, and Azure Managed Grafana; ensure full MELT (Metrics, Events, Logs, Traces) coverage
- Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow
- Develop and operate platform automation, runbooks, and self-healing capabilities using Azure Automation, Logic Apps, and Python/PowerShell scripting
- Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines; automated performance testing, synthetic monitoring, and progressive deployment (canary/blue-green) strategies
- Manage reliability of AKS clusters across multiple Azure regions, own node pool scaling, upgrade strategy and cluster hardening in alignment with CIS Benchmarks
- Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones
- AZ-104
- AZ-305
- AZ-400
- CKA
- ITIL v4 Foundation