Oracle Health is seeking a Senior Site Reliability Engineer to build a modern, automated healthcare platform that millions rely on. The role involves designing, automating, and operating secure, highly available cloud services to drive reliability, speed, and efficiency across the platform.
Responsibilities:
- Own service reliability end-to-end: architecture, production operations, and on-call excellence
- Build automation and self-healing systems using IaC (e.g., Terraform) and CI/CD
- Design, implement, and evolve observability (metrics, tracing, logging) and SLO/error budgets
- Lead capacity planning, performance tuning, and cost/sustainability initiatives
- Develop tooling and services to improve scalability, availability, and developer productivity
- Partner with cross-functional teams to deliver features safely (canary/blue‑green, progressive delivery)
- Drive incident response, root-cause analysis, and prevention through automation
- Prototype and standardize platform services and best practices across teams
Requirements:
- US citizenship and the ability to obtain/maintain a federal security clearance
- Experience operating large-scale, distributed, fault-tolerant systems in production
- Strong scripting/programming (Python, Bash; Java/C++ a plus)
- Infrastructure as Code and automation (Terraform; Ansible/Chef/Puppet/Packer a plus)
- CI/CD pipelines and tooling (Git, GitLab/Jenkins/Rundeck)
- Cloud experience (OCI, AWS, Azure or similar)
- Deep knowledge of monitoring, alerting, incident management, and postmortems
- Solid grasp of networking, security fundamentals, and performance engineering
- Experience in regulated or high-compliance environments
- Data/analytics and platform sustainability optimization
- Containers and orchestration (Kubernetes, Docker)