Practice by Numbers is seeking a Senior Site Reliability Engineer (SRE) to enhance the reliability of their mission-critical healthcare infrastructure. The role involves owning reliability outcomes, designing scalable systems, and mentoring engineers on reliability practices.

Responsibilities:

Own reliability outcomes for critical services: availability, latency, incident rate, and recovery time
Design and build reliable, scalable distributed systems that support mission-critical healthcare workflows
Define and operationalize SLOs/SLIs and error budgets; drive adoption across teams and use them to prioritize work
Lead incident response for high-severity issues; improve on-call effectiveness and reduce alert fatigue
Run blameless postmortems and ensure follow-ups are implemented, measured, and stick
Write software to eliminate operational toil: automation, self-service tooling, guardrails, and developer platforms
Raise the bar on observability (metrics/logs/traces), alerting strategy, and operational readiness
Improve resilience through capacity planning, load testing, performance tuning, and failure testing
Mentor engineers (SRE and product engineers) on reliability practices, debugging, and production ownership
Drive cross-team improvements like production readiness reviews, release safety (progressive delivery), and standard runbooks

Requirements:

Engineering degree is mandatory: BS/MS in Computer Science, Computer Engineering, Electrical Engineering, or a closely related engineering field
6+ years experience in software engineering, SRE, infrastructure/platform engineering, or related
Strong programming skills in Go, Python, Java, or similar (production-quality code)
Proven experience building and operating production backend services or distributed systems
Meaningful experience in on-call rotations, incident leadership, and post-incident improvement execution
Strong debugging ability across complex systems: latency, saturation, cascading failures, dependency issues
Experience with cloud infrastructure (AWS preferred, GCP/Azure acceptable)
You've owned reliability for customer-facing services with clear, measurable improvements (e.g., higher availability, lower MTTR)
You've built internal platforms/tooling that made other engineers faster and reduced operational burden
You've worked in an SRE culture with SLOs, error budgets, and blameless postmortems
You've led multi-quarter reliability initiatives spanning multiple teams/services

Senior Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: