Ventures Unlimited Inc is looking for a Site Reliability Engineer to enhance their Azure-hosted services' reliability and performance. The role involves defining service level objectives, leading architectural reviews, and implementing observability and incident management practices.

Responsibilities:

Define, own, and enforce enterprise-wide SLOs, SLIs, and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly
Lead architectural reviews for new services and ensure reliability non-functionals (availability targets, RTO/RPO) are embedded from design through to production
Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks
Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions
Serve as Incident Commander for P1/P2 major incidents, own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR)
Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of <5 minutes for Tier-0 services
Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA
Design and operate the enterprise observability stack: Azure Monitor, Log Analytics Workspaces, Application Insights, and Azure Managed Grafana; ensure full MELT (Metrics, Events, Logs, Traces) coverage
Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow
Develop and operate platform automation, runbooks, and self-healing capabilities using Azure Automation, Logic Apps, and Python/PowerShell scripting
Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines; automated performance testing, synthetic monitoring, and progressive deployment (canary/blue-green) strategies
Manage reliability of AKS clusters across multiple Azure regions, own node pool scaling, upgrade strategy and cluster hardening in alignment with CIS Benchmarks
Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones

Requirements:

7+ years of experience in SRE, platform engineering, or cloud infrastructure engineering in large-scale enterprise environments (10,000+ employees or equivalent complexity)
Deep, hands-on expertise with Microsoft Azure — minimum 4 years in a primary Azure cloud engineering role
Expert-level proficiency with AKS: cluster lifecycle management, RBAC, network policies, pod security standards, cluster autoscaler, and Workload Identity
Strong infrastructure-as-code skills: Terraform (required) and/or Bicep; experience managing Azure Landing Zones or Enterprise-Scale architecture
Proficiency in at least one systems programming/scripting language: Python (preferred), Go, or PowerShell
Experience designing and operating enterprise observability platforms using Azure Monitor, Log Analytics and Application Insights at scale
Demonstrable track record of owning SLOs/SLIs and delivering measurable reliability improvements in production
Strong knowledge of enterprise networking in Azure: Hub-and-Spoke/Virtual WAN, ExpressRoute, Azure Firewall, NSGs, Private Endpoints, and DNS Private Zones
Define, own, and enforce enterprise-wide SLOs, SLIs, and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly
Lead architectural reviews for new services and ensure reliability non-functionals (availability targets, RTO/RPO) are embedded from design through to production
Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks
Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions
Serve as Incident Commander for P1/P2 major incidents, own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR)
Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of <5 minutes for Tier-0 services
Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA
Design and operate the enterprise observability stack: Azure Monitor, Log Analytics Workspaces, Application Insights, and Azure Managed Grafana; ensure full MELT (Metrics, Events, Logs, Traces) coverage
Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow
Develop and operate platform automation, runbooks, and self-healing capabilities using Azure Automation, Logic Apps, and Python/PowerShell scripting
Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines; automated performance testing, synthetic monitoring, and progressive deployment (canary/blue-green) strategies
Manage reliability of AKS clusters across multiple Azure regions, own node pool scaling, upgrade strategy and cluster hardening in alignment with CIS Benchmarks
Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones
AZ-104
AZ-305
AZ-400
CKA
ITIL v4 Foundation

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: