Own uptime, reliability, and performance of services running on AWS + Kubernetes (EKS).
Design and implement self-healing infrastructure using automation and AI agents.
Build LLM-powered operational tooling using APIs such as the OpenAI API for:
Intelligent alert triage
Incident summarization
Root cause analysis
Runbook automation
Manage and scale Kubernetes workloads:
Deployments, autoscaling, resource optimization
Cluster reliability and cost efficiency
Build and evolve observability systems:
Metrics (Prometheus), dashboards (Grafana)
Logs (ELK / OpenSearch)
Tracing (OpenTelemetry)
Define and enforce SLOs, SLAs, and error budgets tied to business metrics.
Automate infrastructure using Terraform and CI/CD pipelines.
Lead incident response, postmortems, and continuous reliability improvements.
Introduce chaos engineering practices to proactively test system resilience.

Saviynt is an equal opportunity employer and we welcome everyone to our team.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status.

Staff Site Reliability Engineer

Key skills