Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents.
Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services.
Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier.
Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling for day-to-day operations.
Support safe change: Improve deployments, rollback mechanisms, and operational readiness.
Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems.
Collaborate closely with engineers: Work with product and feature teams to improve production readiness.

3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.
Experience operating cloud infrastructure (AWS preferred).
Working knowledge of Kubernetes and containerised workloads.
Infrastructure as Code experience (Terraform or similar).
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
Scripting or automation experience (Python, Bash, or similar).

Site Reliability Engineer

Key skills