Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently
Represent reliability in architectural discussions; surface risk before it's committed to design
Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs
Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity
Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience
6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor level
Deep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns
Proficiency with Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent)
Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture sufficient to make reliability and deployment decisions for EKS-hosted services
Deep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub Actions
Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferred
Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations considered a strong asset
Tech Stack
AWS
Docker
Java
Kubernetes
Python
Spring
Spring Boot
SpringBoot
Terraform
Benefits
Company-sponsored training and development opportunities