SmarterDx is a clinical AI company transforming how hospitals translate care into payment. They are seeking a Staff Site Reliability Engineer to lead the reliability, scalability, and operational excellence of their production systems, implementing SRE practices and enhancing system observability.
Responsibilities:
- Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact
- Implement a “reliability” platform using Terraform and infrastructure-as-code best practices
- Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR)
- Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence
- Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms
- Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability
- Research, prototype, and advocate for new reliability practices, tooling, and architectural improvements across the engineering organization
Requirements:
- 10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments
- 3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems
- Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements
- Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly
- Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes
- Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field — or equivalent practical experience operating large-scale systems
- Reliability engineering experience with production database systems (e.g. Postgres)