Participate in on-call and incident response:
Respond to production incidents, contribute to service restoration, and support clear communication during incidents.
Improve operational reliability:
Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment:
Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services.
Strengthen observability:
Improve dashboards, alerts, logs, and traces so issues are detected earlier.
Reduce operational toil:
Automate repetitive tasks, simplify runbooks, and improve tooling for day-to-day operations.
Support safe change:
Improve deployments, rollback mechanisms, and operational readiness.
Contribute to operational practices:
Write and maintain runbooks, participate in blameless post-mortems.
Collaborate closely with engineers:
Work with product and feature teams to improve production readiness.
Requirements
3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.