Own uptime, reliability, and performance of services running on AWS + Kubernetes (EKS).
Design and implement self-healing infrastructure using automation and AI agents.
Build LLM-powered operational tooling using APIs such as the OpenAI API for:
Intelligent alert triage
Incident summarization
Root cause analysis
Runbook automation
Manage and scale Kubernetes workloads:
Deployments, autoscaling, resource optimization
Cluster reliability and cost efficiency
Build and evolve observability systems:
Metrics (Prometheus), dashboards (Grafana)
Logs (ELK / OpenSearch)
Tracing (OpenTelemetry)
Define and enforce SLOs, SLAs, and error budgets tied to business metrics.
Automate infrastructure using Terraform and CI/CD pipelines.
Lead incident response, postmortems, and continuous reliability improvements.
Introduce chaos engineering practices to proactively test system resilience.
Requirements
8+ years in SRE / DevOps / Platform Engineering.
Strong hands-on experience with:
AWS infrastructure at scale
Kubernetes (production-grade clusters)
Proven ability to debug complex distributed systems under pressure.
Strong coding skills (Python or Go)—you build internal platforms and tools.
Experience implementing monitoring, alerting, and incident management systems.
Bonus (AI / LLM Focus):
Experience working with LLM APIs such as the OpenAI API.
Familiarity with agent frameworks like:
LangChain
AutoGen
Built or experimented with:
AI agents for DevOps / SRE workflows
Retrieval-Augmented Generation (RAG) systems
Vector databases (Pinecone, Weaviate, etc.)
Exposure to AIOps or intelligent automation systems.
Tech Stack
AWS
Distributed Systems
Grafana
Kubernetes
Prometheus
Python
Terraform
Go
Benefits
Saviynt is an equal opportunity employer and we welcome everyone to our team.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status.
Staff Site Reliability Engineer at Saviynt | JobVerse