Build and own a production-grade cloud infrastructure
Design auto-scaling infrastructure that absorbs sharp traffic peaks without manual intervention
Own the full observability stack: metrics, logs, traces, SLI dashboards, and SLO-based alerting
Build and own the incident response process: post-mortems, severity SLAs, on-call rotation for peak events, and systematic MTTR reduction
Harden the platform through security best practices: least-privilege IAM, secrets management, CI/CD scanning and compliance posture (SOC 2)
Build golden-path IaC templates so engineers self-serve environments in minutes and improve CI/CD speed

You have spent a meaningful time (5+ years) as a hands-on SRE, DevOps, or Platform Engineer with production experience on AWS or GCP, Infrastructure as Code tooling (Terraform, Pulumi, or equivalent), database scaling (Postgres or equivalent), and multi-region infrastructure management
You love building robust, scalable systems. Reliability and performance are your craft
You have significant expertise scaling cloud platforms. You have already operated infrastructure that handled large, unpredictable traffic at scale
You are passionate about entertainment: enabling 150k+ people to get into a show without a hitch is your kind of challenge
You communicate clearly with the rest of the tech team. You see developer experience as a core part of your job and take pride in empowering engineers to ship faster and with confidence
You are curious, with an active watch on cloud infrastructure, reliability practices, and developer tooling
You have a strong sense of ownership

Site Reliability Engineer – DevOps

Key skills