Practice by Numbers is seeking a Senior Site Reliability Engineer (SRE) to enhance the reliability of their mission-critical healthcare infrastructure. The role involves owning reliability outcomes, designing scalable systems, and mentoring engineers on reliability practices.
Responsibilities:
- Own reliability outcomes for critical services: availability, latency, incident rate, and recovery time
- Design and build reliable, scalable distributed systems that support mission-critical healthcare workflows
- Define and operationalize SLOs/SLIs and error budgets; drive adoption across teams and use them to prioritize work
- Lead incident response for high-severity issues; improve on-call effectiveness and reduce alert fatigue
- Run blameless postmortems and ensure follow-ups are implemented, measured, and stick
- Write software to eliminate operational toil: automation, self-service tooling, guardrails, and developer platforms
- Raise the bar on observability (metrics/logs/traces), alerting strategy, and operational readiness
- Improve resilience through capacity planning, load testing, performance tuning, and failure testing
- Mentor engineers (SRE and product engineers) on reliability practices, debugging, and production ownership
- Drive cross-team improvements like production readiness reviews, release safety (progressive delivery), and standard runbooks
Requirements:
- Engineering degree is mandatory: BS/MS in Computer Science, Computer Engineering, Electrical Engineering, or a closely related engineering field
- 6+ years experience in software engineering, SRE, infrastructure/platform engineering, or related
- Strong programming skills in Go, Python, Java, or similar (production-quality code)
- Proven experience building and operating production backend services or distributed systems
- Meaningful experience in on-call rotations, incident leadership, and post-incident improvement execution
- Strong debugging ability across complex systems: latency, saturation, cascading failures, dependency issues
- Experience with cloud infrastructure (AWS preferred, GCP/Azure acceptable)
- You've owned reliability for customer-facing services with clear, measurable improvements (e.g., higher availability, lower MTTR)
- You've built internal platforms/tooling that made other engineers faster and reduced operational burden
- You've worked in an SRE culture with SLOs, error budgets, and blameless postmortems
- You've led multi-quarter reliability initiatives spanning multiple teams/services