Brooksource is seeking a Senior Site Reliability Engineer for their Fortune 14 healthcare client. This role involves technical leadership in ensuring uptime, scalability, and incident resilience across enterprise platforms, while collaborating with engineering and infrastructure teams to solve reliability challenges.

Responsibilities:

Collaborate across engineering, development, and infrastructure teams to solve complex reliability challenges using automation and observability
Maintain and evolve cloud‑native Sonexus platforms (Azure preferred; GCP optional), optimizing availability, latency, and performance
Lead and participate in high‑impact incident response (P1/P2), using playbooks, logs, and anomaly detection tools such as Splunk and Dynatrace
Develop and refine alerting strategies based on SLIs, SLOs, and error budgets to reduce noise and drive actionable response
Build and maintain dashboards to monitor latency, error rates, throughput, and overall system health
Automate operational runbooks and response workflows using Python and infrastructure‑as‑code tools (Terraform preferred)
Conduct root cause analysis and lead blameless postmortems with action tracking via Azure DevOps
Drive observability best practices within application code, enabling feature telemetry and end‑to‑end distributed tracing
Define, measure, and continuously refine SLIs and SLOs based on system behavior and user experience
Support platform risk management, governance, and compliance activities
Partner with DevOps and engineering teams to support SRE‑aligned change management and release processes across CI/CD pipelines
Assist with capacity planning, reliability forecasting, and long‑term system resilience strategy
Lead and mentor a small team of SREs, providing technical guidance and support as needed
Perform additional duties as assigned

Requirements:

5+ years of experience in Site Reliability Engineering, DevOps, or platform operations supporting cloud‑native, enterprise applications
Proven experience leading or mentoring SREs in a production environment
Hands‑on experience with observability platforms, including Dynatrace, Splunk, and Azure Monitor
Strong background in web application debugging and performance optimization (HTML, JavaScript, Angular, React, .NET)
Proficiency with automation and infrastructure tooling such as Terraform and Python scripting
Experience designing and operating alerting and reliability strategies using SLO metrics and load or performance testing tools
Familiarity with OpenTelemetry, Prometheus, and distributed tracing concepts
Experience supporting or integrating API management platforms (e.g., Azure API Management, Apigee)
Ability to function as a hands‑on engineer while also providing technical leadership and direction
Experience working in healthcare or regulated environments
Familiarity with compliance frameworks impacting healthcare platforms
Exposure to AI‑assisted or predictive operations initiatives

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: