Commence is a company focused on data-centric transformation in healthcare, aiming to improve health outcomes through efficient processes. They are seeking a Senior Site Reliability Engineer to ensure the reliability and operational health of their healthcare data platform, collaborating with engineering teams and managing incident responses.

Responsibilities:

Design, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systems
Define and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teams
Lead incident response: triage, coordinate remediation, conduct blameless post-mortems, and drive systemic fixes
Build and maintain CI/CD pipelines that support rapid, safe delivery of changes to production
Collaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure-as-code (Terraform or CloudFormation)
Design and operate highly available, fault-tolerant systems—including auto-scaling, failover, and disaster recovery strategies
Reduce operational toil through automation; eliminate manual processes before they become habits
Collaborate with software engineers to establish reliability-first design patterns and review architectures for operational risk
Manage Kubernetes or container orchestration environments at scale
Ensure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2)
Provide technical mentorship and guidance to engineers across the organization on reliability practices
Participate in on-call rotation with a commitment to continuously reducing the need for it

Requirements:

7+ years of experience in SRE, platform engineering, or DevOps roles
Exceptional problem-solving under pressure—demonstrated track record of diagnosing complex, high-stakes system failures and building durable solutions
Deep hands-on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, CloudWatch, and related tooling
Familiarity with infrastructure-as-code (Terraform or CloudFormation)—able to contribute to existing configurations
Experience designing and operating distributed systems with strict availability and latency requirements
Proficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and tooling
Experience with container orchestration (Kubernetes, ECS) in production environments
Expertise in observability tooling (OpenSearch, Prometheus/Grafana, or equivalent)
Hands-on experience with CI/CD platforms (GitHub Actions, Jenkins, CircleCI, or similar)
Proven ability to define and operationalize SLOs and error budgets
Experience with relational and NoSQL databases—performance tuning, replication, and backup strategies
Strong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLS
Excellent communication skills—able to translate technical risk into business impact for non-engineering stakeholders
AWS Certifications (Solutions Architect, DevOps Engineer, or SysOps Administrator)
Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP)
Familiarity with chaos engineering practices and tooling
Experience with data pipeline reliability (ETL/ELT workflows, streaming systems)
Exposure to AI/ML infrastructure and the reliability challenges unique to model serving
Familiarity with additional cloud platforms (Azure, Google Cloud)
Contributions to open-source reliability or infrastructure tooling

Sr Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: