Yahoo serves as a trusted guide for hundreds of millions of people globally, helping them achieve their goals online through our portfolio of iconic products. As a Senior Data Engineer on the Consumer Data Organization, you will design and implement streaming data pipelines that process billions of user signals daily, ensuring data freshness for downstream activation and monetization use cases worth hundreds of millions in annual revenue.

Responsibilities:

Develop and optimize real-time streaming pipelines for third-party ID mutations, behavioral signals, and user event ingestion
Build scalable Kafka-based data pipelines handling millions of events per second with exactly-once processing semantics
Implement Apache Dataflow/Beam jobs for stream processing, enrichment, validation, and transformation of user signals
Design comprehensive monitoring and data quality checks ensuring pipeline reliability, data freshness, and SLA compliance
Collaborate with Storage team on efficient Cloud Spanner write patterns, schema design, and high-throughput mutation strategies
Optimize pipeline performance to reduce lag, improve throughput, and minimize processing costs in GCP infrastructure
Implement dead letter queues, retry logic, and error handling strategies ensuring data loss prevention
Troubleshoot production data issues including pipeline failures, data quality problems, and performance degradation
Work with Privacy team to ensure compliant data handling, PII protection, and sensitive data detection in real-time streams
Create comprehensive documentation for pipeline architecture, operational runbooks, and on-call procedures
Participate in on-call rotation supporting production streaming pipelines with 99.9% uptime SLA
Partner with upstream data producers to ensure consistent event schemas and data quality

Requirements:

Bachelor's degree in Computer Science, Data Engineering, Software Engineering, or related technical field
5+ years data engineering experience building production data systems
3+ years hands-on experience with real-time/streaming data processing systems at scale
2+ years with GCP (Dataflow, Pub/Sub, BigQuery, Spanner, GCS) or AWS equivalents (Kinesis, EMR, DynamoDB)
Strong proficiency in Python, Java, or Scala for data pipeline development
Hands-on experience with Apache Kafka, Google Pub/Sub, or other distributed messaging platforms
Experience with Apache Beam, Apache Dataflow, or Apache Spark Streaming for stream processing
Understanding of stream processing patterns: windowing, watermarks, exactly-once semantics, state management
SQL proficiency and experience with distributed databases (Spanner, Cassandra, DynamoDB)
Familiarity with data serialization formats: Avro, Protobuf, JSON, Parquet
Strong problem-solving skills and operational excellence mindset in production environments
Demonstrated ability delivering reliable data pipelines on schedule with minimal guidance
Excellent collaboration across engineering, product, and infrastructure teams
Team-level impact with ability to influence technical decisions within immediate team
Understanding of data governance and privacy compliance (GDPR, CCPA) in data pipelines
Experience with Cloud Spanner writes at high throughput (millions of writes per second)
Knowledge of data governance frameworks, privacy compliance, and PII handling best practices
Prior experience in adtech, identity platforms, or consumer data systems processing user behavioral data
Familiarity with data quality frameworks: Great Expectations, Deequ, or custom validation systems
Understanding of event-driven architectures, change data capture (CDC), and event sourcing patterns
Experience with schema evolution, schema registries (Confluent Schema Registry, Apicurio)
Contributions to open-source streaming projects (Kafka, Beam, Flink) or data engineering communities
Self-driven, detail-oriented, excellent multitasking abilities in fast-paced environments

Senior Software Engineer - Real-Time Ingestion

Key skills

About this role

Responsibilities:

Requirements: