Hydrolix is revolutionizing data management and analytics with their innovative cloud data platform. They are seeking a Site Reliability Engineer to contribute to the reliability and scalability of their platform, ensuring exceptional solutions tailored to customer needs.
Responsibilities:
- Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms
- Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services
- Build and optimize CI/CD tools and processes to ensure efficient and reliable deployments
- Develop and manage monitoring, alerting, and incident response strategies to minimize downtime and enable rapid recovery
- Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures
- Automate repetitive tasks and optimize system performance to improve operational efficiency
- Participate in covering weekday business hours and once-monthly weekend shifts
- Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle
- Champion SRE best practices and foster a culture of operational excellence across the organization
- Collaborate with a distributed team of engineers worldwide to provide round-the-clock support
- Interface with customers to address and resolve reported incidents, ensuring a seamless user experience
Requirements:
- Proven experience as a Site Reliability Engineer or similar role, with a history of supporting complex distributed systems (minimum five years supporting complex distributed systems)
- Experience with monitoring and debugging tools like Prometheus, Vector, Grafana, Superset, or Kibana
- Proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode)
- Experience with SQL databases; familiarity with PostgreSQL is a plus but not required
- Proficiency in programming languages such as Python, Go, or Rust
- Strong experience with Linux systems, including performance tuning and system-level troubleshooting
- Excellent written and verbal communication skills, with the ability to convey technical concepts clearly to diverse audiences, including customers and cross-functional teams