Fabric Health is dedicated to solving healthcare's biggest challenge: clinical capacity. As a Senior Site Reliability Engineer, you will own and evolve the infrastructure that powers healthcare experiences for millions, ensuring platform resilience, scalability, and compliance while integrating AI-driven operations.
Responsibilities:
- Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users
- Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform
- Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability
- Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks
- Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe
- Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
- Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs
- Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR)
- Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards
- Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
- Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews
Requirements:
- 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
- Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
- Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
- Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
- A 'rigor-first' mindset with a dedication to HIPAA-compliant, high-availability architecture
- Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency