Fabric Health is a company focused on enhancing healthcare delivery through intelligent automation. The Senior Site Reliability Engineer will be responsible for the infrastructure that supports healthcare experiences, ensuring resilience, scalability, and compliance while exploring AI-driven operations.
Responsibilities:
- Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users
- Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform
- Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability
- Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks
- Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe
- Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
- Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs
- Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR)
- Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards
- Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
- Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews
Requirements:
- 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
- Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
- Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
- Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
- A 'rigor-first' mindset with a dedication to HIPAA-compliant, high-availability architecture
- Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency