SmarterDx is a clinical AI company transforming how hospitals translate care into payment. They are seeking a Staff Site Reliability Engineer to lead the reliability, scalability, and operational excellence of their production systems, implementing SRE practices and enhancing system observability.

Responsibilities:

Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact
Implement a “reliability” platform using Terraform and infrastructure-as-code best practices
Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR)
Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence
Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms
Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability
Research, prototype, and advocate for new reliability practices, tooling, and architectural improvements across the engineering organization

Requirements:

10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments
3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems
Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements
Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly
Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes
Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns
Bachelor's or Master's degree in Computer Science, Engineering, or a related field — or equivalent practical experience operating large-scale systems
Reliability engineering experience with production database systems (e.g. Postgres)

Staff Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: