Cutover is a company that values inclusivity and empathy in the workplace. They are seeking a Site Reliability Engineer to ensure the reliability and performance of their production systems, collaborating closely with support and engineering teams to optimize the platform's reliability.

Responsibilities:

Incident Response: Respond to incidents and alerts, triaging urgency and investigating root cause
Documentation: Regular contributions to improve our documentation on system design, troubleshooting, best practices, and engineering processes
Root Cause Analysis: Contribute to post-mortems and help identify long-term improvements under guidance
Collaboration: Support cross-functional teams during investigations and post-incident reviews
Observability: Support and enhance observability tools and techniques by identifying metrics, logging, and alerting improvements
Automation: Write and execute simple automation scripts (e.g. Python, Ruby, Bash) to improve reliability and toil reduction
Development: Work on internal tools, pipelines, and IaC solutions to help improve the speed of software delivery and recovery
System Reliability: Work on efforts to enhance the reliability and performance of our application and systems, ensuring optimal uptime and minimal disruptions
Infrastructure Optimization: Work closely with the development and platform engineering teams to optimize the infrastructure on AWS, ensuring scalability and efficiency

Requirements:

A genuine excitement for complex problem solving within our tech stack, applying what you know to our unique problems
Familiarity with at least one scripting language such as Ruby, JavaScript, Python, Bash
Experience with containerization (i.e. Docker) or IaC (e.g. Terraform, Helm, CloudFormation)
An eagerness to follow modern engineering practices and learn from others
Familiarity with observability tools such as DataDog, New Relic, Grafana, Prometheus, ELK, or OpenTelemetry
Understanding of core networking concepts (DNS, HTTP/S, Load Balancing, etc.)
A collaborative mindset with clear communication skills
Willing to ask questions to gain a better understanding of new or complex concepts
Exposure to major incident response processes
AWS Certified Cloud Practitioner or hands-on experience with cloud environments

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: