Obsidian Security is a company focused on securing SaaS applications for enterprises. The Sr. Staff Site Reliability Engineer will define and drive the company's reliability vision for a multi-tenant SaaS platform, ensuring system issues are detected and resolved before impacting customers.
Responsibilities:
- Define and lead long-term reliability strategy across services
- Establish end-to-end system visibility frameworks and guide architecture for observability, detection, and resilience
- Partner across teams to embed reliability, standardize SLI/SLOs, and serve as a technical escalation expert
- Build intelligent detection systems (anomaly detection, connector health models) and enable self-service observability
- Define and evolve a tiered incident communication strategy, improve response practices, and lead postmortems to strengthen reliability and customer trust
- Contribute hands-on to system design, monitoring, and debugging across distributed systems and data pipelines
Requirements:
- 5+ years in SRE, Production Engineering, or related roles
- 3+ years operating at a senior or technical leadership level (Staff or equivalent scope)
- Deep expertise in: AWS and/or GCP
- Deep expertise in: Kubernetes and Helm
- Deep expertise in: Observability stacks (Prometheus, Grafana, or equivalent)
- Deep expertise in: CI/CD systems (GitLab CI/CD, ArgoCD, etc.)
- Proven experience designing and scaling reliability systems for multi-tenant SaaS platforms
- Strong debugging and systems thinking across distributed microservices and legacy systems
- Demonstrated ability to lead initiatives that improve incident detection, response, and system resilience
- Hands-on engineering approach with a track record of building—not just configuring—reliability systems
- Experience in B2B SaaS serving enterprise or financial customers
- Familiarity with third-party SaaS connector architectures and ingestion patterns
- Experience building anomaly detection or intelligent alerting systems
- Experience designing customer-facing status pages and incident communication frameworks