Own and improve on-call processes, incident response playbooks, and post-mortem culture
Define, track, and manage SLOs, SLIs, and error budgets for critical services
Lead blameless post-mortems and drive systematic reliability improvements
Respond to production incidents and coordinate cross-functional resolution
Design, build, and maintain scalable AWS infrastructure using IaC (Terraform, Pulumi)
Manage Kubernetes clusters and containerized workloads in production
Build and maintain CI/CD pipelines to improve deployment speed and reliability
Evaluate and implement tooling to enhance developer productivity and system stability
Implement monitoring, alerting, and distributed tracing (Prometheus, Grafana, Datadog, Jaeger)
Identify and resolve performance bottlenecks across services, networks, and databases
Build dashboards and runbooks for self-service operational insights
Partner with engineering teams to embed reliability practices (load testing, capacity planning, chaos engineering)
Conduct architecture reviews with a focus on reliability and operability

A high-impact role at a growing SaaS company that values personal growth, accountability, and teamwork
A culture of open collaboration and problem-solving
100% remote
Competitive pay

Senior Site Reliability Engineer

Key skills