Expedite Talent Solutions is seeking a Principal SRE to lead reliability engineering strategy and establish enterprise-wide observability, incident management, and reliability governance. The role involves designing SLIs/SLOs, driving automation, and working closely with engineering teams to embed reliability practices across the application lifecycle.

Responsibilities:

Define and implement SLIs, SLOs, and reliability targets aligned with organizational Golden Pathways
Build and operationalize observability standards across metrics, logs, and traces
Establish SRE telemetry ingestion pipelines and reliability engineering workflows
Design and implement telemetry for:
Application Performance Monitoring (APM) – service response times and bottleneck detection
Logging & Tracing – correlated logs and distributed tracing
Event & Alerting – meaningful alerts tied to severity and actionability
Build service health dashboards and monitoring pipelines using Grafana
Evolve incident management practices and RCA frameworks
Develop automation workflows to improve detection, response, and recovery
Implement RCA tagging, compliance monitoring, and lifecycle tracking
Design and build a Central SRE Operating View and Golden Dashboard (Single Pane of Glass)
Aggregate telemetry including reliability metrics, incident trends, MTTR, RCA themes, alert noise, and resilience indicators across 40+ applications
Provide executive dashboards for CIO/VP visibility and monthly reliability reviews
Develop executive scorecards including:
Per-application reliability score
SRE maturity score
MTTD / MTTR / MTTRestore metrics
Escalation patterns and failure trends
Deliver runbooks, telemetry integration guides, and RCA enforcement playbooks

Requirements:

Lead reliability engineering strategy and establish enterprise-wide observability, incident management, and reliability governance
Design and implement SLIs/SLOs
Drive automation and build centralized reliability visibility using Grafana and ServiceNow Performance Analytics
Work closely with engineering teams to embed reliability practices across the application lifecycle
Create a Single Pane of Glass for operational and executive insight across 40+ applications
Define and implement SLIs, SLOs, and reliability targets aligned with organizational Golden Pathways
Build and operationalize observability standards across metrics, logs, and traces
Establish SRE telemetry ingestion pipelines and reliability engineering workflows
Design and implement telemetry for Application Performance Monitoring (APM) – service response times and bottleneck detection
Implement Logging & Tracing – correlated logs and distributed tracing
Develop meaningful alerts tied to severity and actionability
Build service health dashboards and monitoring pipelines using Grafana
Evolve incident management practices and RCA frameworks
Develop automation workflows to improve detection, response, and recovery
Implement RCA tagging, compliance monitoring, and lifecycle tracking
Design and build a Central SRE Operating View and Golden Dashboard (Single Pane of Glass)
Aggregate telemetry including reliability metrics, incident trends, MTTR, RCA themes, alert noise, and resilience indicators across 40+ applications
Provide executive dashboards for CIO/VP visibility and monthly reliability reviews
Develop executive scorecards including per-application reliability score, SRE maturity score, MTTD / MTTR / MTTRestore metrics, escalation patterns and failure trends
Deliver runbooks, telemetry integration guides, and RCA enforcement playbooks

Site Reliability Engineering Manager/Architect/Principle SRE

Key skills

About this role

Responsibilities:

Requirements: