Joblet-AI is seeking a Site Reliability Engineer to ensure production systems are reliable, observable, and performant. The role involves combining software engineering with operations to automate processes and enhance system reliability.
Responsibilities:
- Design and operate systems for high availability and performance
- Build and maintain observability tooling (logging, metrics, tracing)
- Define and track SLOs, SLIs, and error budgets
- Lead incident response and post-mortem reviews
- Automate operational toil through tooling and platform improvements
- Partner with application teams on production readiness
Requirements:
- 4+ years in SRE, DevOps, or infrastructure engineering
- Strong scripting and software engineering skills (Python, Go, or similar)
- Deep experience with cloud platforms (AWS, GCP, Azure)
- Hands-on with Kubernetes, Terraform, and observability platforms
- Experience leading incident response in production environments
- Strong understanding of distributed systems