TechInsights is the information platform for the semiconductor industry, providing in-depth intelligence and analysis to over 650 companies. They are seeking a Senior Site Reliability Engineer to lead strategic reliability initiatives and build the foundation for their AI-first platform, focusing on observability, incident response, and mentoring engineering teams.
Responsibilities:
- Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
- Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
- Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
- Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
- Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
- Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
- Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation
- Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently
- Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling
- Represent reliability in architectural discussions; surface risk before it's committed to design
- Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs
- Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
- Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput
- Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement
- Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations
- Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance
- Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale
- Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression
- Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity
Requirements:
- Bachelor's degree in Computer Science, Engineering, or equivalent combination of education and experience
- 6–8 years of progressive experience in site reliability engineering, platform engineering, or DevOps, with demonstrated technical leadership at the senior individual contributor level
- Deep expertise in AWS (EKS, Lambda, CloudWatch, AWS Config) and multi-region architecture patterns
- Proficiency with Terraform and GitOps; experience with policy-as-code (Sentinel, OPA/Rego, or equivalent)
- Hands-on Datadog experience at operational depth: dashboards, SLO tracking, alerting, log management, distributed tracing
- Strong containerization expertise: Docker, Kubernetes (EKS preferred)
- Proficiency in Python and/or Bash; experience building operational tooling; solid understanding of Java and Spring Boot microservice architecture sufficient to make reliability and deployment decisions for EKS-hosted services
- Deep expertise in CI/CD pipeline design and optimization using Bitbucket Pipelines and GitHub Actions
- Familiarity with IDP tooling (Backstage, Atlassian Compass, or equivalent) strongly preferred
- Experience with AI/ML workload infrastructure, LLM API integration, or agentic system operations considered a strong asset
- Leads and owns strategic reliability initiatives end-to-end with a high degree of autonomy; accountable for outcomes, not just tasks
- Sets technical direction and influences team and department strategy
- Solves complex, ambiguous reliability problems through systematic analysis and first-principles thinking
- Formally mentors junior and intermediate engineers; builds team capability through coaching and knowledge transfer
- Communicates technical reliability concepts clearly to engineering, product, and leadership audiences
- Approaches operational work with an AI-first posture: builds automation and intelligent tooling as the default
- Experience designing reliability architecture for agentic AI systems: agent loop observability, blast radius isolation, graceful degradation for LLM-dependent services
- AWS certifications: Solutions Architect Professional, DevOps Engineer Professional, or equivalent
- FinOps Certified Practitioner or demonstrated cloud cost management experience at scale
- IDP implementation or developer experience program leadership
- Experience in semiconductor, SaaS, or data-intensive platform environments
- Experience operating in environments with export-controlled or regulated data
- Knowledge of BCP/DR program management and formal recovery testing