Operate, scale, and continuously improve our production Kubernetes clusters on Google Cloud Platform (GCP)
Manage and provision cloud infrastructure using Infrastructure as Code (Terraform)
Maintain and optimise critical messaging and event-streaming systems (RabbitMQ, Kafka)
Manage edge networking, traffic routing, and security using Cloudflare
Improve CI/CD pipelines to enable safe, fast, and reliable deployments
Partner with development teams to optimise Java services (JVM tuning, connection pooling, container resource allocation)
Manage and troubleshoot logging and observability tools (e.g. Elasticsearch, Kibana)
Support and advise on high-availability data stores such as MySQL and ClickHouse
Lead incident response as Incident Commander during major production events
Coordinate cross-functional teams and communicate effectively with both technical and non-technical stakeholders
Conduct blameless postmortems and drive improvements to prevent recurring issues
Design and execute load testing strategies to validate system performance under peak conditions
Define and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Improve monitoring and alerting using Prometheus and Grafana, reducing noise and improving MTTR

3+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Strong hands-on experience with Kubernetes in production environments
Solid experience with GCP (or another major cloud provider)
Proficiency with Infrastructure as Code tools (Terraform preferred)
Experience with messaging systems or event streaming platforms (RabbitMQ, Kafka)
Strong troubleshooting skills across infrastructure, networking, and application layers
Experience handling production incidents and conducting postmortems
Scripting and automation skills (e.g. Bash, Python)
Strong Linux systems knowledge

Private health insurance
Wellness incentives, including a fitness allowance and mental well-being services
Flexible national holidays: public holidays mean more time off, choose how and when to enjoy them!
2 weeks Work From Anywhere (10 days), increased to 4 weeks (20 days) after longer duration of employment within the Company: explore the world while working remotely
Gourmet lunches and healthy snacks prepared by our in-house chef
Variety of discounts from local vendors
Access to some of the greatest tools and platforms for developing your professional skills and building success within your role
A range of training courses, known as Casumo College, for continuous learning and growth
Social events for building strong relationships with colleagues from all across the organisation

SRE Engineer

Key skills