Commence is a company focused on data-centric transformation in healthcare. As a Senior Site Reliability Engineer, you will ensure the reliability and operational health of the healthcare data platform, bridging engineering and operations while implementing observability and automation solutions.
Responsibilities:
- Design, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systems
- Define and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teams
- Lead incident response: triage, coordinate remediation, conduct blameless post-mortems, and drive systemic fixes
- Build and maintain CI/CD pipelines that support rapid, safe delivery of changes to production
- Collaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure-as-code (Terraform or CloudFormation)
- Design and operate highly available, fault-tolerant systems—including auto-scaling, failover, and disaster recovery strategies
- Reduce operational toil through automation; eliminate manual processes before they become habits
- Collaborate with software engineers to establish reliability-first design patterns and review architectures for operational risk
- Manage Kubernetes or container orchestration environments at scale
- Ensure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2)
- Provide technical mentorship and guidance to engineers across the organization on reliability practices
- Participate in on-call rotation with a commitment to continuously reducing the need for it
Requirements:
- 7+ years of experience in SRE, platform engineering, or DevOps roles
- Exceptional problem-solving under pressure—demonstrated track record of diagnosing complex, high-stakes system failures and building durable solutions
- Deep hands-on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, CloudWatch, and related tooling
- Familiarity with infrastructure-as-code (Terraform or CloudFormation)—able to contribute to existing configurations
- Experience designing and operating distributed systems with strict availability and latency requirements
- Proficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and tooling
- Experience with container orchestration (Kubernetes, ECS) in production environments
- Expertise in observability tooling (OpenSearch, Prometheus/Grafana, or equivalent)
- Hands-on experience with CI/CD platforms (GitHub Actions, Jenkins, CircleCI, or similar)
- Proven ability to define and operationalize SLOs and error budgets
- Experience with relational and NoSQL databases—performance tuning, replication, and backup strategies
- Strong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLS
- Excellent communication skills—able to translate technical risk into business impact for non-engineering stakeholders
- AWS Certifications (Solutions Architect, DevOps Engineer, or SysOps Administrator)
- Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP)
- Familiarity with chaos engineering practices and tooling
- Experience with data pipeline reliability (ETL/ELT workflows, streaming systems)
- Exposure to AI/ML infrastructure and the reliability challenges unique to model serving
- Familiarity with additional cloud platforms (Azure, Google Cloud)
- Contributions to open-source reliability or infrastructure tooling