Ditto is redefining how data moves at the edge, making it seamless for developers to build resilient, real-time applications. They are seeking experienced Site Reliability Engineers to ensure their infrastructure delivers enterprise-grade reliability, focusing on observability, system reliability, and operational excellence for their cutting-edge database technology.
Responsibilities:
- Develop and maintain observability solutions using platforms like Datadog, Prometheus and Grafana
- Take a leading role in incident management, including coordinating response efforts, troubleshooting issues, and identifying follow-up actions
- Partner with product engineering teams to architect reliable systems, recover from incidents, and learn from mistakes
- Work with teams to implement and maintain SLOs, monitoring, and alerting strategies that ensure reliability at scale
- Design and implement automation and support tooling to improve system resilience, maintain operational safety and reduce operational overhead
- Lead the development and maintenance of runbooks, alert definitions, and incident response procedures
- Participate in on-call rotations to provide 24/7 support for critical production systems
Requirements:
- 4+ years of experience in Site Reliability Engineering or similar DevOps roles focused on system reliability and incident management
- 2+ years of hands-on experience architecting applications for Kubernetes, and managing Kubernetes infrastructure
- Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
- Experience in at least one systems programming language, such as Go, Rust, C, or Java
- Expertise with Infrastructure as Code tools, like Terraform and Helm
- Expertise with at least one major cloud service provider (AWS, GCP, Azure)
- Strong communication skills, with the ability to lead incident response and effectively collaborate across teams
- Willingness and experience engaging with on-call rotations and emergency response procedures
- A high degree of agency and bias towards action. Identify problems and work autonomously to solve them
- Excellent problem-solving skills and a methodical approach to troubleshooting complex issues
- Experience building multi-tenant, multi-cloud SaaS/DBaaS Platforms
- Knowledge of edge computing or mesh networking
- Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
- Experience working with globally distributed teams
- Proven experience in project management