Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements
Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation
Requirements
SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
Automation & Software Engineering: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability. This includes solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging) and a consistent focus on writing clean, well-structured, and maintainable code
Reliability, Data & Operations: add stakeholder engagement and mentoring e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others
Nice-to-Have: Experience operating in highly regulated industries (e.g. Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking setups, including exposure to security technologies such as Cloudflare.
Strong understanding of microservices architectures, their principles and trade-offs, with the ability to troubleshoot and maintain distributed systems and supporting technologies (RabbitMQ, Kafka, PostgreSQL, Redis).
Hands-on experience with Datadog for platform and application monitoring, performance optimisation, and solid fundamentals in database structures and operational troubleshooting. Hands-on experience with PySpark and familiarity with MLOps practices including model registries, versioning, retraining workflows, and deployment lifecycles.
Tech Stack
AWS
Cloud
Distributed Systems
DNS
Kafka
Kubernetes
Microservices
Postgres
PySpark
Python
RabbitMQ
Redis
Terraform
Benefits
Work Your Way: Enjoy full flexibility – work from home, the office or a mix of both. Plus, work from anywhere for up to 30 days a year.
Grow with us: Get access to learning resources, mentorship and a growth plan tailored to you.
Thrive and perform: Enjoy private healthcare, gym discounts, wellbeing programs and mental health support.