Yelp is a company that values individual authenticity and encourages creative solutions within a collaborative engineering culture. They are seeking a Site Reliability Engineer specializing in Kafka to manage their real-time data streaming infrastructure and ensure the reliability of critical business functions as the company grows.

Responsibilities:

Design, deploy, and maintain large-scale Kafka event streaming infrastructure across hybrid and multi-cloud environments
Collaborate with engineers to enable new features, ensure data pipeline reliability, and advise on best practices for real-time data processing
Execute and automate Kafka cluster upgrades, migrations, and major version rollouts with minimal impact to critical services
Build or enhance self-service capabilities and automation for cluster operations, scaling, and incident recovery
Troubleshoot complex issues affecting data flow, performance, or stability, and drive root cause analyses
Participate in on-call rotations. Our geographically distributed SRE teams use a “follow-the-sun” model, so no one needs to be on-call 24 hours a day!

Requirements:

Strong hands-on experience designing and implementing large-scale Kafka event streaming capabilities in production, across hybrid or multi-cloud and Linux environments, including upgrades and migrations between platforms or versions
In-depth knowledge of event streaming/data-in-motion design principles, architecture, and operational nuances
Programming proficiency in Java, Python, or similar modern languages for tooling, integration, and automation
Familiarity with Kafka Client APIs (Producer, Consumer, Streams), as well as sizing and capacity planning for high-throughput clusters
Experience designing and optimizing real-time data streaming solutions with technologies like Apache Flink
Knowledge of automating infrastructure and operational tasks (configuration management, IaC, scripting, or related)
Problem-solving mindset with an eagerness to learn, take initiative, and advocate for infrastructure best practices in a fast-paced environment

Site Reliability Engineer, Core Streaming (Remote - Canada)

Key skills

About this role

Responsibilities:

Requirements: