ClickHouse is a fast-growing private cloud company recognized on the 2025 Forbes Cloud 100 list, specializing in real-time analytics and data warehousing. They are seeking a Senior Site Reliability Engineer to enhance the reliability and performance of their cloud infrastructure by collaborating with various engineering teams and managing incident response processes.
Responsibilities:
- Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
- Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents
- Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
- Continuously improve the reliability and performance of our ClickHouse services
- Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
- Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime
Requirements:
- Bachelor's or Master's degree in Computer Science or a related field
- At least 8 years of experience in Site Reliability Engineering or a related field
- Hands-on experience with Go and/or Python
- Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
- Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus
- Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
- Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
- You are a strong problem solver and have solid production debugging skills
- You are passionate about efficiency, availability, scalability, and data governance
- You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
- You have a high level of responsibility, ownership, and accountability
- Excellent communication and interpersonal skills