Yelp is a company that values individual authenticity and encourages creative solutions within a collaborative engineering culture. They are seeking a Site Reliability Engineer specializing in Kafka to manage their real-time data streaming infrastructure and ensure the reliability of critical business functions as the company grows.
Responsibilities:
- Design, deploy, and maintain large-scale Kafka event streaming infrastructure across hybrid and multi-cloud environments
- Collaborate with engineers to enable new features, ensure data pipeline reliability, and advise on best practices for real-time data processing
- Execute and automate Kafka cluster upgrades, migrations, and major version rollouts with minimal impact to critical services
- Build or enhance self-service capabilities and automation for cluster operations, scaling, and incident recovery
- Troubleshoot complex issues affecting data flow, performance, or stability, and drive root cause analyses
- Participate in on-call rotations. Our geographically distributed SRE teams use a “follow-the-sun” model, so no one needs to be on-call 24 hours a day!
Requirements:
- Strong hands-on experience designing and implementing large-scale Kafka event streaming capabilities in production, across hybrid or multi-cloud and Linux environments, including upgrades and migrations between platforms or versions
- In-depth knowledge of event streaming/data-in-motion design principles, architecture, and operational nuances
- Programming proficiency in Java, Python, or similar modern languages for tooling, integration, and automation
- Familiarity with Kafka Client APIs (Producer, Consumer, Streams), as well as sizing and capacity planning for high-throughput clusters
- Experience designing and optimizing real-time data streaming solutions with technologies like Apache Flink
- Knowledge of automating infrastructure and operational tasks (configuration management, IaC, scripting, or related)
- Problem-solving mindset with an eagerness to learn, take initiative, and advocate for infrastructure best practices in a fast-paced environment