Develop and maintain tools to maximise engineering efficiency; such as but not limited to automating deployment infrastructure and database upgrades
Seek out processes that can be improved with automation and have internal Developer Experience as a main driver
Collaborate and enable engineers to do their jobs more efficiently
Create, maintain and test our system disaster recovery process, including tooling to automate the process
Handle production incidents, author blameless postmortems and enrich operational playbooks and runbooks
Run performance investigations and drive tuning across apps, data stores and AWS
Own cost optimisation workstreams: right-sizing, autoscaling policies, workload scheduling, storage tiering, and identifying waste across ECS/Fargate, RDS/Aurora, SQS and observability
Be an advocate of the GitOps methodology

A software development background, with experience shipping and operating production services
Have experience working across the AWS ecosystem
A curiosity about AI and how it’s reshaping software development
Collaborative, security-minded, and detail-oriented
Knowledge of platform and ops concepts such as networking and Linux administration
Experience working with microservices and distributed systems at scale
Experience with monitoring tools: we use Opentelemetry, Honeycomb, Grafana, Pingdom and Incident.io

Site Reliability Engineer

Key skills