Hands-on Reliability & System Engineering: Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements
Automation, Operations & Incident Response: Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
Performance, Capacity & Security: Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation
Requirements
SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
Automation & Software Engineering: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability, including solid Python proficiency and deep knowledge of the Python ecosystem
Reliability, Data & Operations: Add stakeholder engagement and mentoring, e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others.
Tech Stack
AWS
Cloud
DNS
Kubernetes
Python
Terraform
Benefits
Work Your Way: Enjoy full flexibility – work from home, the office or a mix of both. Plus, work from anywhere for up to 30 days a year.
Grow with us: Get access to learning resources, mentorship and a growth plan tailored to you.
Thrive and perform: Enjoy private healthcare, gym discounts, wellbeing programs and mental health support.