Clearpath is seeking a Principal Site Reliability Engineer with deep systems-level expertise and a proven track record in production healthcare environments. This role involves serving as the highest escalation point for production incidents and acting as the primary technical interface for complex customer deployments, requiring a blend of technical skills and communication clarity.
Responsibilities:
- Serve as the senior escalation point for critical production incidents across cloud-hosted and customer-premise deployments, owning issues from initial triage through root cause identification and resolution
- Perform advanced live-system diagnostics on Linux hosts and Docker/ECS container environments, including log aggregation, process inspection, resource contention analysis, and crash dump review
- Analyze Laravel/PHP, C# and Rust application behavior in production: parsing structured and unstructured application logs, tracing exception stack traces, diagnosing session and cache failures (Redis/ElastiCache), and identifying OOM conditions, deadlocks, or misconfigured queue workers
- Investigate and resolve AWS infrastructure-layer issues spanning EC2 instances, ECS task and service health, SQS message backlog and DLQ accumulation, SNS delivery failures, S3 access and policy errors, Aurora connection pool exhaustion, and CloudWatch alarm and metric anomalies
- Conduct in-depth container-level debugging: inspecting Docker layer builds, ECS task definition misconfigurations, networking between tasks via Cloud Map or ALB target groups, and environment variable or secrets injection failures from Secrets Manager
- Use CloudWatch Logs Insights, Metrics, and X-Ray (or equivalent) to correlate distributed system failures across service boundaries and identify latency outliers, error rate spikes, and cascading failure patterns
- Diagnose and resolve failures in DICOM workflows including C-STORE, C-FIND, C-MOVE, C-GET operations and DICOMweb (WADO-RS, STOW-RS, QIDO-RS) endpoints, using tools such as DCMTK utilities, dcmdump, storescu, findscu, and Wireshark/tcpdump packet captures
- Troubleshoot DICOM association negotiation failures, transfer syntax mismatches, SOP class rejection, and modality connectivity issues across multi-site PACS and VNA deployments
- Analyze HL7 v2 message flows (ADT, ORM, ORU, MDM) through integration engines and custom adapters, identifying parsing errors, field mapping failures, segment ordering issues, and acknowledgment (ACK/NACK) problems
- Collaborate with clinical informatics and integration teams at customer sites to resolve interoperability issues between modalities, RIS, EHR, and the platform's imaging exchange infrastructure
- Lead technical engagement for complex multi-site deployment issues spanning customer on-premise networks
- Diagnose network-layer issues affecting DICOM connectivity: port accessibility, firewall rule conflicts, MTU mismatch, TLS certificate errors, and proxy interference with DICOMweb or HL7 MLLP traffic
- Engage directly with customer IT and networking teams to coordinate resolution of infrastructure-side issues, translating complex platform requirements into actionable guidance for non-specialist audiences
- Document multi-site deployment architectures, network topology dependencies, and known issue patterns in Confluence to build institutional knowledge and accelerate future incident resolution
- Author detailed post-incident reports (PIRs) with timeline reconstruction, root cause analysis, contributing factors, and corrective action items, distributing findings to engineering, product, and customer stakeholders
- Build and maintain runbooks, diagnostic playbooks, and escalation decision trees in Confluence for common failure categories, enabling support and customer success teams to handle a larger share of incidents independently
- Partner with engineering teams to surface systemic issues discovered through support patterns, advocating for observability improvements, defensive coding practices, and configuration guardrails
- Define and track support SLA metrics including MTTR, escalation rate, and repeat incident frequency, reporting trends to leadership and recommending operational improvements
- Mentor and technically upskill support engineers and customer success managers on platform architecture, DICOM/HL7 fundamentals, and AWS infrastructure concepts
Requirements:
- 8+ years of professional experience in a senior technical support, site reliability, or platform operations role with direct production system responsibility
- Expert-level Linux administration: process management, filesystem operations, network diagnostics (netstat, ss, tcpdump, curl, dig, nmap), systemd/journald, cron, and shell scripting (bash)
- Advanced Docker expertise including image inspection, container runtime debugging, volume and network configuration, multi-stage builds, and log collection from running or exited containers
- Hands-on AWS operational experience across EC2, ECS (Fargate and EC2 launch types), SQS, SNS, S3, Aurora PostgreSQL, ElastiCache, CloudWatch Logs and Metrics, Secrets Manager, IAM, VPC, ALB, and Route 53
- Demonstrated proficiency troubleshooting DICOM protocols at the association and message level, with practical experience using DCMTK utilities or equivalent diagnostic tools
- Working knowledge of HL7 v2 message structure, segment definitions, and integration engine behavior sufficient to parse and diagnose message-level failures without vendor assistance
- Strong PostgreSQL/Aurora operational skills: query analysis with EXPLAIN, connection pool monitoring, slow query identification, replication lag assessment, and schema-level investigation
- Experience diagnosing complex, multi-hop network issues in hybrid cloud and on-premise environments, including VPN, DNS, TLS, and firewall troubleshooting
- Excellent written and verbal communication skills, with demonstrated ability to produce executive-ready incident summaries and technically precise root cause analyses simultaneously
- Experience supporting DICOMweb-based imaging exchange platforms, VNA infrastructure, or PACS/RIS integrations in a multi-site healthcare environment
- Production experience supporting Laravel applications: understanding of Artisan commands, queue workers, Horizon, scheduled tasks, .env configuration, storage and cache layers, and common failure modes in containerized Laravel deployments
- AWS certifications such as SysOps Administrator, Solutions Architect, or Advanced Networking Specialty
- Proficiency with CloudWatch Logs Insights query syntax and experience building operational dashboards and composite alarms for production monitoring
- Exposure to OpenTofu or Terraform for reading and understanding infrastructure definitions relevant to incident investigation
- Experience with Wireshark, tcpdump, or similar packet analysis tools for DICOM DIMSE and HL7 MLLP traffic capture and inspection
- Familiarity with the Atlassian Suite
- Background supporting regulated healthcare software