Clearpath is seeking a Principal Site Reliability Engineer with deep systems-level expertise and a proven track record in production healthcare environments. This role involves serving as the highest escalation point for production incidents and acting as the primary technical interface for complex customer deployments, requiring a blend of technical skills and communication clarity.

Responsibilities:

Serve as the senior escalation point for critical production incidents across cloud-hosted and customer-premise deployments, owning issues from initial triage through root cause identification and resolution
Perform advanced live-system diagnostics on Linux hosts and Docker/ECS container environments, including log aggregation, process inspection, resource contention analysis, and crash dump review
Analyze Laravel/PHP, C# and Rust application behavior in production: parsing structured and unstructured application logs, tracing exception stack traces, diagnosing session and cache failures (Redis/ElastiCache), and identifying OOM conditions, deadlocks, or misconfigured queue workers
Investigate and resolve AWS infrastructure-layer issues spanning EC2 instances, ECS task and service health, SQS message backlog and DLQ accumulation, SNS delivery failures, S3 access and policy errors, Aurora connection pool exhaustion, and CloudWatch alarm and metric anomalies
Conduct in-depth container-level debugging: inspecting Docker layer builds, ECS task definition misconfigurations, networking between tasks via Cloud Map or ALB target groups, and environment variable or secrets injection failures from Secrets Manager
Use CloudWatch Logs Insights, Metrics, and X-Ray (or equivalent) to correlate distributed system failures across service boundaries and identify latency outliers, error rate spikes, and cascading failure patterns
Diagnose and resolve failures in DICOM workflows including C-STORE, C-FIND, C-MOVE, C-GET operations and DICOMweb (WADO-RS, STOW-RS, QIDO-RS) endpoints, using tools such as DCMTK utilities, dcmdump, storescu, findscu, and Wireshark/tcpdump packet captures
Troubleshoot DICOM association negotiation failures, transfer syntax mismatches, SOP class rejection, and modality connectivity issues across multi-site PACS and VNA deployments
Analyze HL7 v2 message flows (ADT, ORM, ORU, MDM) through integration engines and custom adapters, identifying parsing errors, field mapping failures, segment ordering issues, and acknowledgment (ACK/NACK) problems
Collaborate with clinical informatics and integration teams at customer sites to resolve interoperability issues between modalities, RIS, EHR, and the platform's imaging exchange infrastructure
Lead technical engagement for complex multi-site deployment issues spanning customer on-premise networks
Diagnose network-layer issues affecting DICOM connectivity: port accessibility, firewall rule conflicts, MTU mismatch, TLS certificate errors, and proxy interference with DICOMweb or HL7 MLLP traffic
Engage directly with customer IT and networking teams to coordinate resolution of infrastructure-side issues, translating complex platform requirements into actionable guidance for non-specialist audiences
Document multi-site deployment architectures, network topology dependencies, and known issue patterns in Confluence to build institutional knowledge and accelerate future incident resolution
Author detailed post-incident reports (PIRs) with timeline reconstruction, root cause analysis, contributing factors, and corrective action items, distributing findings to engineering, product, and customer stakeholders
Build and maintain runbooks, diagnostic playbooks, and escalation decision trees in Confluence for common failure categories, enabling support and customer success teams to handle a larger share of incidents independently
Partner with engineering teams to surface systemic issues discovered through support patterns, advocating for observability improvements, defensive coding practices, and configuration guardrails
Define and track support SLA metrics including MTTR, escalation rate, and repeat incident frequency, reporting trends to leadership and recommending operational improvements
Mentor and technically upskill support engineers and customer success managers on platform architecture, DICOM/HL7 fundamentals, and AWS infrastructure concepts

Requirements:

8+ years of professional experience in a senior technical support, site reliability, or platform operations role with direct production system responsibility
Expert-level Linux administration: process management, filesystem operations, network diagnostics (netstat, ss, tcpdump, curl, dig, nmap), systemd/journald, cron, and shell scripting (bash)
Advanced Docker expertise including image inspection, container runtime debugging, volume and network configuration, multi-stage builds, and log collection from running or exited containers
Hands-on AWS operational experience across EC2, ECS (Fargate and EC2 launch types), SQS, SNS, S3, Aurora PostgreSQL, ElastiCache, CloudWatch Logs and Metrics, Secrets Manager, IAM, VPC, ALB, and Route 53
Demonstrated proficiency troubleshooting DICOM protocols at the association and message level, with practical experience using DCMTK utilities or equivalent diagnostic tools
Working knowledge of HL7 v2 message structure, segment definitions, and integration engine behavior sufficient to parse and diagnose message-level failures without vendor assistance
Strong PostgreSQL/Aurora operational skills: query analysis with EXPLAIN, connection pool monitoring, slow query identification, replication lag assessment, and schema-level investigation
Experience diagnosing complex, multi-hop network issues in hybrid cloud and on-premise environments, including VPN, DNS, TLS, and firewall troubleshooting
Excellent written and verbal communication skills, with demonstrated ability to produce executive-ready incident summaries and technically precise root cause analyses simultaneously
Experience supporting DICOMweb-based imaging exchange platforms, VNA infrastructure, or PACS/RIS integrations in a multi-site healthcare environment
Production experience supporting Laravel applications: understanding of Artisan commands, queue workers, Horizon, scheduled tasks, .env configuration, storage and cache layers, and common failure modes in containerized Laravel deployments
AWS certifications such as SysOps Administrator, Solutions Architect, or Advanced Networking Specialty
Proficiency with CloudWatch Logs Insights query syntax and experience building operational dashboards and composite alarms for production monitoring
Exposure to OpenTofu or Terraform for reading and understanding infrastructure definitions relevant to incident investigation
Experience with Wireshark, tcpdump, or similar packet analysis tools for DICOM DIMSE and HL7 MLLP traffic capture and inspection
Familiarity with the Atlassian Suite
Background supporting regulated healthcare software

Principal Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: