Oracle Public Safety is delivering a next‑generation SaaS platform that empowers First Responders with resilient, secure, and highly available software. They are seeking an experienced Principal Site Reliability Engineer to help build, operate, and optimize their production platforms.
Responsibilities:
- Define and own service-level objectives (SLOs), SLIs, and error budgets; drive reliability roadmaps with engineering and product
- Design for resilience, high availability, and disaster recovery across regions and tenants; conduct capacity planning and load testing
- Proactively identify and remediate reliability, latency, and scalability bottlenecks
- Build and operate production infrastructure and shared platform services on Oracle Cloud Infrastructure (OCI)
- Develop infrastructure as code (e.g., Terraform/Ansible) and automate provisioning, configuration, and compliance
- Evolve CI/CD pipelines to enable safe, frequent, and reversible deployments (progressive delivery, canary, and automated rollbacks)
- Implement and mature end‑to‑end observability (metrics, logs, traces, profiling, RUM) with actionable alerting and SLO‑based paging
- Support incident response, post‑incident reviews, and problem management; convert findings into backlog items and architectural changes
- Create runbooks, readiness checks, and game days; drive chaos testing and failure injection where appropriate
- Embed security controls in the SDLC and platform (secrets management, image scanning, vulnerability management, policy as code)
- Partner with security and compliance teams to meet enterprise and public safety requirements; support audits and evidence gathering
- Ensure least-privilege access, network segmentation, and data protection across environments
- Collaborate with software engineers on production‑readiness reviews, capacity/scalability patterns, and cost optimization
- Provide guidance on operability, architecture, and performance for microservices, data pipelines, and real‑time event processing
- Mentor teammates; contribute to standards, documentation, and knowledge sharing
Requirements:
- 6–10 years of hands-on experience in Site Reliability Engineering, Production Engineering, or closely related software/systems roles
- Strong Linux/Unix fundamentals (Oracle Linux preferred) and systems performance tuning
- Proficiency operating services on OCI (preferred) or another major cloud; solid understanding of networking, VPCs, IAM, and security groups
- Containers and orchestration expertise: Docker and Kubernetes (including Helm, operators, and multi‑cluster strategies)
- CI/CD experience (Jenkins or GitLab CI) with progressive delivery patterns, quality gates, and environment promotions
- Java experience is required, including debugging, performance tuning, and operability of Java-based microservices in production
- Scripting and automation in Bash
- Infrastructure as Code and automation: Terraform, Ansible
- Datastores: Oracle Database, MySQL; familiarity with MS SQL and/or NoSQL is a plus; experience with performance, HA, and backup/restore
- Observability: hands-on with metrics/logs/traces (e.g., Prometheus, Grafana, OCI Monitoring/Logging, OpenTelemetry); alert design and runbooks
- Version control and collaboration: Git (Bitbucket preferred); issue tracking and documentation (Jira, Confluence)
- Experience with ITIL practices (Incident, Problem, Change; Foundation certification preferred) and Agile delivery frameworks
- Familiarity with web and microservices architectures, REST/GraphQL, API gateways, and edge/CDN patterns
- A systems thinker with excellent communication skills; able to move from strategy to detailed implementation and influence across teams
- Self‑starter; comfortable owning complex production systems and driving cross‑functional reliability initiatives
- Experience with service mesh (e.g., Istio), policy as code (OPA), and secrets management (Vault/OCI Vault)
- Chaos engineering, reliability testing frameworks, or game day facilitation
- Cost management/FinOps in multi‑tenant SaaS
- Experience supporting AI/ML or real‑time event/data processing platforms in production