Cutover is a company that values inclusivity and empathy in the workplace. They are seeking a Site Reliability Engineer to ensure the reliability and performance of their production systems, collaborating closely with support and engineering teams to optimize the platform's reliability.
Responsibilities:
- Incident Response: Respond to incidents and alerts, triaging urgency and investigating root cause
- Documentation: Regular contributions to improve our documentation on system design, troubleshooting, best practices, and engineering processes
- Root Cause Analysis: Contribute to post-mortems and help identify long-term improvements under guidance
- Collaboration: Support cross-functional teams during investigations and post-incident reviews
- Observability: Support and enhance observability tools and techniques by identifying metrics, logging, and alerting improvements
- Automation: Write and execute simple automation scripts (e.g. Python, Ruby, Bash) to improve reliability and toil reduction
- Development: Work on internal tools, pipelines, and IaC solutions to help improve the speed of software delivery and recovery
- System Reliability: Work on efforts to enhance the reliability and performance of our application and systems, ensuring optimal uptime and minimal disruptions
- Infrastructure Optimization: Work closely with the development and platform engineering teams to optimize the infrastructure on AWS, ensuring scalability and efficiency
Requirements:
- A genuine excitement for complex problem solving within our tech stack, applying what you know to our unique problems
- Familiarity with at least one scripting language such as Ruby, JavaScript, Python, Bash
- Experience with containerization (i.e. Docker) or IaC (e.g. Terraform, Helm, CloudFormation)
- An eagerness to follow modern engineering practices and learn from others
- Familiarity with observability tools such as DataDog, New Relic, Grafana, Prometheus, ELK, or OpenTelemetry
- Understanding of core networking concepts (DNS, HTTP/S, Load Balancing, etc.)
- A collaborative mindset with clear communication skills
- Willing to ask questions to gain a better understanding of new or complex concepts
- Exposure to major incident response processes
- AWS Certified Cloud Practitioner or hands-on experience with cloud environments