Fabric Health is dedicated to solving healthcare's biggest challenge: clinical capacity. As a Senior Site Reliability Engineer, you will own and evolve the infrastructure that powers healthcare experiences for millions, ensuring platform resilience, scalability, and compliance while integrating AI-driven operations.

Responsibilities:

Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users
Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform
Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability
Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks
Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe
Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs
Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR)
Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards
Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews

Requirements:

5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
A 'rigor-first' mindset with a dedication to HIPAA-compliant, high-availability architecture
Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: