Oracle Public Safety is delivering a next‑generation SaaS platform that empowers First Responders with resilient, secure, and highly available software. They are seeking an experienced Principal Site Reliability Engineer to help build, operate, and optimize their production platforms.

Responsibilities:

Define and own service-level objectives (SLOs), SLIs, and error budgets; drive reliability roadmaps with engineering and product
Design for resilience, high availability, and disaster recovery across regions and tenants; conduct capacity planning and load testing
Proactively identify and remediate reliability, latency, and scalability bottlenecks
Build and operate production infrastructure and shared platform services on Oracle Cloud Infrastructure (OCI)
Develop infrastructure as code (e.g., Terraform/Ansible) and automate provisioning, configuration, and compliance
Evolve CI/CD pipelines to enable safe, frequent, and reversible deployments (progressive delivery, canary, and automated rollbacks)
Implement and mature end‑to‑end observability (metrics, logs, traces, profiling, RUM) with actionable alerting and SLO‑based paging
Support incident response, post‑incident reviews, and problem management; convert findings into backlog items and architectural changes
Create runbooks, readiness checks, and game days; drive chaos testing and failure injection where appropriate
Embed security controls in the SDLC and platform (secrets management, image scanning, vulnerability management, policy as code)
Partner with security and compliance teams to meet enterprise and public safety requirements; support audits and evidence gathering
Ensure least-privilege access, network segmentation, and data protection across environments
Collaborate with software engineers on production‑readiness reviews, capacity/scalability patterns, and cost optimization
Provide guidance on operability, architecture, and performance for microservices, data pipelines, and real‑time event processing
Mentor teammates; contribute to standards, documentation, and knowledge sharing

Requirements:

6–10 years of hands-on experience in Site Reliability Engineering, Production Engineering, or closely related software/systems roles
Strong Linux/Unix fundamentals (Oracle Linux preferred) and systems performance tuning
Proficiency operating services on OCI (preferred) or another major cloud; solid understanding of networking, VPCs, IAM, and security groups
Containers and orchestration expertise: Docker and Kubernetes (including Helm, operators, and multi‑cluster strategies)
CI/CD experience (Jenkins or GitLab CI) with progressive delivery patterns, quality gates, and environment promotions
Java experience is required, including debugging, performance tuning, and operability of Java-based microservices in production
Scripting and automation in Bash
Infrastructure as Code and automation: Terraform, Ansible
Datastores: Oracle Database, MySQL; familiarity with MS SQL and/or NoSQL is a plus; experience with performance, HA, and backup/restore
Observability: hands-on with metrics/logs/traces (e.g., Prometheus, Grafana, OCI Monitoring/Logging, OpenTelemetry); alert design and runbooks
Version control and collaboration: Git (Bitbucket preferred); issue tracking and documentation (Jira, Confluence)
Experience with ITIL practices (Incident, Problem, Change; Foundation certification preferred) and Agile delivery frameworks
Familiarity with web and microservices architectures, REST/GraphQL, API gateways, and edge/CDN patterns
A systems thinker with excellent communication skills; able to move from strategy to detailed implementation and influence across teams
Self‑starter; comfortable owning complex production systems and driving cross‑functional reliability initiatives
Experience with service mesh (e.g., Istio), policy as code (OPA), and secrets management (Vault/OCI Vault)
Chaos engineering, reliability testing frameworks, or game day facilitation
Cost management/FinOps in multi‑tenant SaaS
Experience supporting AI/ML or real‑time event/data processing platforms in production

Principal Site Reliability Engineer - Public Safety

Key skills

About this role

Responsibilities:

Requirements: