Enlyte is a company that combines innovative technology, clinical expertise, and human compassion to help people recover after workplace injuries or auto accidents. The Principal Reliability Engineer is a senior technical leader responsible for the reliability, observability, and operational control of enterprise platforms and services, leading complex reliability initiatives and partnering with various teams to ensure system resilience and operability at scale.

Responsibilities:

Own the reliability control plane, including standards and architecture for monitoring, logging, tracing, alerting, and incident management
Define how services expose health, performance, and operational signals across the enterprise
Establish and evolve reliability patterns and reference architectures adopted across teams
Lead design decisions that improve system resilience, fault tolerance, and recoverability
Own integrations between platforms and reliability tooling (monitoring, alerting, incident response, on-call, and automation systems)
Define consistent approaches to telemetry collection, normalization, and consumption
Ensure observability tooling provides actionable visibility aligned to service-level objectives
Evaluate and recommend tooling improvements that enhance visibility and operational insight
Lead complex reliability initiatives impacting multiple systems or platforms
Partner with engineering teams to design reliable, observable services from inception
Drive adoption of best practices for operational readiness, graceful degradation, and failure handling
Review system designs to ensure reliability and observability requirements are met
Establish standards for alerting quality, escalation, and incident response
Drive improvements in incident detection, diagnosis, and recovery
Lead or support root cause analysis for significant incidents and ensure durable corrective actions
Promote a culture of operational excellence and continuous reliability improvement
Serve as a technical leader and trusted advisor on reliability and observability
Mentor senior engineers and influence reliability practices across teams
Collaborate with Cloud, Security, and Platform leaders to align reliability strategy with business needs

Requirements:

12+ years of related experience with a Bachelor's degree; or equivalent professional experience
Extensive experience designing and operating reliable, observable systems at scale
Proven success owning or leading observability and incident management platforms
Background in cloud, platform, or infrastructure engineering strongly preferred
Deep expertise in reliability engineering, observability, and distributed systems
Strong understanding of monitoring, logging, tracing, alerting, and incident management concepts
Experience integrating and operating reliability tooling at enterprise scale
Solid grasp of cloud and platform architectures and their operational characteristics
Ability to translate operational risk and system behavior into actionable engineering improvements

Principal Reliability Engineer

About this role

Responsibilities:

Requirements: