Brooksource is seeking a Senior Site Reliability Engineer for their Fortune 14 healthcare client. This role involves technical leadership in ensuring uptime, scalability, and incident resilience across enterprise platforms, while collaborating with engineering and infrastructure teams to solve reliability challenges.
Responsibilities:
- Collaborate across engineering, development, and infrastructure teams to solve complex reliability challenges using automation and observability
- Maintain and evolve cloud‑native Sonexus platforms (Azure preferred; GCP optional), optimizing availability, latency, and performance
- Lead and participate in high‑impact incident response (P1/P2), using playbooks, logs, and anomaly detection tools such as Splunk and Dynatrace
- Develop and refine alerting strategies based on SLIs, SLOs, and error budgets to reduce noise and drive actionable response
- Build and maintain dashboards to monitor latency, error rates, throughput, and overall system health
- Automate operational runbooks and response workflows using Python and infrastructure‑as‑code tools (Terraform preferred)
- Conduct root cause analysis and lead blameless postmortems with action tracking via Azure DevOps
- Drive observability best practices within application code, enabling feature telemetry and end‑to‑end distributed tracing
- Define, measure, and continuously refine SLIs and SLOs based on system behavior and user experience
- Support platform risk management, governance, and compliance activities
- Partner with DevOps and engineering teams to support SRE‑aligned change management and release processes across CI/CD pipelines
- Assist with capacity planning, reliability forecasting, and long‑term system resilience strategy
- Lead and mentor a small team of SREs, providing technical guidance and support as needed
- Perform additional duties as assigned
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or platform operations supporting cloud‑native, enterprise applications
- Proven experience leading or mentoring SREs in a production environment
- Hands‑on experience with observability platforms, including Dynatrace, Splunk, and Azure Monitor
- Strong background in web application debugging and performance optimization (HTML, JavaScript, Angular, React, .NET)
- Proficiency with automation and infrastructure tooling such as Terraform and Python scripting
- Experience designing and operating alerting and reliability strategies using SLO metrics and load or performance testing tools
- Familiarity with OpenTelemetry, Prometheus, and distributed tracing concepts
- Experience supporting or integrating API management platforms (e.g., Azure API Management, Apigee)
- Ability to function as a hands‑on engineer while also providing technical leadership and direction
- Experience working in healthcare or regulated environments
- Familiarity with compliance frameworks impacting healthcare platforms
- Exposure to AI‑assisted or predictive operations initiatives