Build & Operate Large-Scale Feature Pipelines: Design and maintain batch/streaming pipelines (Spark, Flink, Databricks, Airflow) producing ML features for ranking models.
Ensure Point-in-Time Correctness: Develop feature sets that enable unbiased offline training and credible online inference.
Develop Embedding & Content Pipelines: Build scalable workflows for metadata, imagery, and multimodal representations; partner with Science teams to operationalize new models.
Architect Data Foundations: Design Delta/Parquet data models and medallion layers, optimizing storage layout and partitioning for latency and cost.
Real-Time Engineering: Build Kafka-based systems for real-time features and user-activity aggregations, ensuring robust handling of out-of-order events and exactly-once semantics.
Governance & Leadership: Define data quality rules and schema evolution processes while collaborating across ML pods to translate model needs into infrastructure.

7+ years of experience in large-scale data or software engineering
Deep experience with Spark (PySpark/Scala), Databricks, Airflow, and Kafka.
Proficiency in feature pipelines, temporal joins, and mitigating training-serving skew.
Experience with AWS/Azure/GCP and high-performance engines like Snowflake or Redshift.
Proficient programming skills in Python and SQL with a focus on performance optimization.
Experience in personalization domains (search, ranking, or recommender systems).
Experience supporting petabyte-scale data lakehouses or feature stores.
Familiarity with GenAI/RAG systems, multimodal content, or Delta Live Tables.
Knowledge of Causal Inference, experimentation signals, or ML evaluation workflows.
Experience with Terraform for governed, repeatable deployments.

Senior Lead Data Engineer, Content Engineering

Key skills