Role Overview

Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements
Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation

Requirements

SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
Automation & Software Engineering: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability. This includes solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging) and a consistent focus on writing clean, well-structured, and maintainable code
Reliability, Data & Operations: add stakeholder engagement and mentoring e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others
Nice-to-Have: Experience operating in highly regulated industries (e.g. Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking setups, including exposure to security technologies such as Cloudflare.
Strong understanding of microservices architectures, their principles and trade-offs, with the ability to troubleshoot and maintain distributed systems and supporting technologies (RabbitMQ, Kafka, PostgreSQL, Redis).
Hands-on experience with Datadog for platform and application monitoring, performance optimisation, and solid fundamentals in database structures and operational troubleshooting. Hands-on experience with PySpark and familiarity with MLOps practices including model registries, versioning, retraining workflows, and deployment lifecycles.

Tech Stack

AWS
Cloud
Distributed Systems
DNS
Kafka
Kubernetes
Microservices
Postgres
PySpark
Python
RabbitMQ
Redis
Terraform

Benefits

Work Your Way: Enjoy full flexibility – work from home, the office or a mix of both. Plus, work from anywhere for up to 30 days a year.
Grow with us: Get access to learning resources, mentorship and a growth plan tailored to you.
Thrive and perform: Enjoy private healthcare, gym discounts, wellbeing programs and mental health support.

Senior Site Reliability Engineer

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits