Alteryx is a leading analytics company focused on transforming how businesses leverage data, automation, and AI. They are seeking a Lead Site Reliability Engineer to own reliability outcomes for a modern multi-region SaaS platform, focusing on system design, reliability strategy, and cross-team execution.
Responsibilities:
- Define and drive reliability strategy across control-plane and data-plane systems, including multi-region resilience, BCDR, and failover design
- Establish and operationalize SLOs, SLAs, and error budgets, ensuring they inform planning and engineering tradeoffs
- Lead initiatives that measurably improve MTTR, incident prevention, and overall service health
- Own incident management end-to-end, driving systemic fixes and long-term reliability improvements beyond immediate response
- Lead architecture and design reviews to ensure systems meet scalability, reliability, and cost efficiency goals
- Champion automation and modernization, including AI-driven reliability improvements
- Establish and enforce code quality and review standards
- Lead cross-functional initiatives and align engineering with product priorities
- Mentor senior engineers and act as a technical leader across teams
Requirements:
- 6+ years leading delivery of complex, distributed systems or SaaS platforms
- Strong experience with multi-region, split-plane architectures (control-plane / data-plane)
- Proven track record improving SLOs, MTTR, and system reliability at scale
- Proficiency in languages like Python, Java, C++, or JavaScript
- Deep experience with Kubernetes (multi-cluster), CI/CD, and GitOps (ArgoCD)
- SLO/SLA design, observability, and incident management
- Infrastructure as Code and cloud platforms
- Disaster recovery, resilience, and security best practices
- Strong leadership skills with experience mentoring senior engineers and influencing cross-team decisions
- Experience with chaos engineering and large-scale reliability automation
- Background in enterprise SaaS platforms or split-plane architectures
- Expertise in navigating, understanding and leveraging modern Observability platforms (Datadog, Grafana, etc)