TechInsights is the information platform for the semiconductor industry, providing in-depth intelligence and analysis to over 650 companies. They are seeking a Senior Site Reliability Engineer to lead strategic reliability initiatives and build the foundation for their AI-first platform, focusing on observability, incident response, and mentoring engineering teams.

Responsibilities:

Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation
Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently
Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling
Represent reliability in architectural discussions; surface risk before it's committed to design
Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs
Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput
Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement
Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations
Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance
Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale
Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression
Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity

Requirements:

Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience
6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor level
Deep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns
Proficiency with Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent)
Hands-on Datadog experience at operational depth: dashboards, SLO tracking, alerting, log management, distributed tracing
Strong containerization expertise: Docker, Kubernetes (EKS preferred)
Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture sufficient to make reliability and deployment decisions for EKS-hosted services
Deep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub Actions
Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferred
Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations considered a strong asset
Leads and owns strategic reliability initiatives end-to-end with a high degree of autonomy; accountable for outcomes, not just tasks
Sets technical direction and influences team and department strategy
Solves complex, ambiguous reliability problems through systematic analysis and first-principles thinking
Formally mentors junior and intermediate engineers; builds team capability through coaching and knowledge transfer
Communicates technical reliability concepts clearly to engineering, product, and leadership audiences
Approaches operational work with an AI-first posture: builds automation and intelligent tooling as the default
Experience designing reliability architecture for agentic AI systems: agent loop observability, blast radius isolation, graceful degradation for LLM-dependent services
AWS certifications: Solutions Architect Professional, DevOps Engineer Professional, or equivalent
FinOps Certified Practitioner or demonstrated cloud cost management experience at scale
IDP implementation or developer experience program leadership
Experience in semiconductor, SaaS, or data-intensive platform environments
Experience operating in environments with export-controlled or regulated data
Knowledge of BCP/DR program management and formal recovery testing

Senior Site Reliability Engineer (Remote Canada)

Key skills

About this role

Responsibilities:

Requirements: