Expedite Talent Solutions is seeking a Principal SRE to lead reliability engineering strategy and establish enterprise-wide observability, incident management, and reliability governance. The role involves designing SLIs/SLOs, driving automation, and working closely with engineering teams to embed reliability practices across the application lifecycle.
Responsibilities:
- Define and implement SLIs, SLOs, and reliability targets aligned with organizational Golden Pathways
- Build and operationalize observability standards across metrics, logs, and traces
- Establish SRE telemetry ingestion pipelines and reliability engineering workflows
- Design and implement telemetry for:
- Application Performance Monitoring (APM) – service response times and bottleneck detection
- Logging & Tracing – correlated logs and distributed tracing
- Event & Alerting – meaningful alerts tied to severity and actionability
- Build service health dashboards and monitoring pipelines using Grafana
- Evolve incident management practices and RCA frameworks
- Develop automation workflows to improve detection, response, and recovery
- Implement RCA tagging, compliance monitoring, and lifecycle tracking
- Design and build a Central SRE Operating View and Golden Dashboard (Single Pane of Glass)
- Aggregate telemetry including reliability metrics, incident trends, MTTR, RCA themes, alert noise, and resilience indicators across 40+ applications
- Provide executive dashboards for CIO/VP visibility and monthly reliability reviews
- Develop executive scorecards including:
- Per-application reliability score
- SRE maturity score
- MTTD / MTTR / MTTRestore metrics
- Escalation patterns and failure trends
- Deliver runbooks, telemetry integration guides, and RCA enforcement playbooks
Requirements:
- Lead reliability engineering strategy and establish enterprise-wide observability, incident management, and reliability governance
- Design and implement SLIs/SLOs
- Drive automation and build centralized reliability visibility using Grafana and ServiceNow Performance Analytics
- Work closely with engineering teams to embed reliability practices across the application lifecycle
- Create a Single Pane of Glass for operational and executive insight across 40+ applications
- Define and implement SLIs, SLOs, and reliability targets aligned with organizational Golden Pathways
- Build and operationalize observability standards across metrics, logs, and traces
- Establish SRE telemetry ingestion pipelines and reliability engineering workflows
- Design and implement telemetry for Application Performance Monitoring (APM) – service response times and bottleneck detection
- Implement Logging & Tracing – correlated logs and distributed tracing
- Develop meaningful alerts tied to severity and actionability
- Build service health dashboards and monitoring pipelines using Grafana
- Evolve incident management practices and RCA frameworks
- Develop automation workflows to improve detection, response, and recovery
- Implement RCA tagging, compliance monitoring, and lifecycle tracking
- Design and build a Central SRE Operating View and Golden Dashboard (Single Pane of Glass)
- Aggregate telemetry including reliability metrics, incident trends, MTTR, RCA themes, alert noise, and resilience indicators across 40+ applications
- Provide executive dashboards for CIO/VP visibility and monthly reliability reviews
- Develop executive scorecards including per-application reliability score, SRE maturity score, MTTD / MTTR / MTTRestore metrics, escalation patterns and failure trends
- Deliver runbooks, telemetry integration guides, and RCA enforcement playbooks