Obsidian Security is a company focused on securing SaaS applications for modern businesses. The Staff Site Reliability Engineer will lead the reliability strategy for a complex multi-tenant SaaS platform, ensuring proactive detection of system failures and improving incident response processes.
Responsibilities:
- Map and instrument critical system paths for top-tier enterprise customers
- Build connector health models to classify issues: Internal defects (“our bug”), Upstream SaaS outages, Expected sparse/low-signal scenarios
- Establish tiered incident communication: Public status page for all customers, Direct outreach for high-priority accounts
- Define and begin rollout of SLI/SLO standards across microservices
- Develop self-service instrumentation tooling enabling engineering teams to own observability
- Implement baseline-aware anomaly detection across all connectors (beyond static thresholds)
- Mature incident response processes, including: Structured post-mortems, Continuous reliability improvements
Requirements:
- 7+ years in SRE, production engineering, or similar roles
- 2+ years operating as a technical lead
- Deep expertise with: AWS and/or GCP
- Deep expertise with: Kubernetes, Helm
- Deep expertise with: Observability stack (Prometheus, Grafana)
- Deep expertise with: CI/CD systems (GitLab CI/CD, ArgoCD)
- Proven experience building monitoring for multi-tenant SaaS systems with complex data pipelines
- Strong debugging skills across distributed microservices and legacy systems
- Hands-on engineering mindset — able to instrument services directly, not just configure tooling
- Track record of building or significantly improving incident detection and response systems
- Experience in B2B SaaS serving enterprise or financial customers
- Familiarity with third-party SaaS connector ingestion patterns
- Experience building anomaly detection systems or baseline-aware alerting
- Experience implementing customer-facing status pages and incident communication frameworks