Build and maintain high‑performance streaming and batch data pipelines that power AI applications, ensuring reliable low‑latency ingestion and high‑throughput processing.
Implement and extend embedding generation workflows, vector store integrations, and retrieval pipelines supporting semantic search, RAG systems, and AI assistants.
Develop and optimize scalable storage and retrieval patterns, focusing on cost‑efficient architecture and smooth production performance.
Implement AI‑optimized data models and storage patterns that align with broader enterprise architecture and platform requirements.
Integrate pipelines with shared AI platform services (agent frameworks, registries, feature stores), ensuring clean, versioned, and reliable data delivery.
Build reusable ingestion, transformation, and data processing components that streamline adoption across engineering teams.
Embed end‑to‑end observability into data systems, including metrics, structured logging, automated alerts, drift detection, and failure analysis.
Implement robust data quality validation, schema evolution safeguards, and governance/compliance controls.
Ensure deployed pipelines meet high standards for reliability, recoverability, auditability, and long‑term maintenance.
Drive execution by owning the full development lifecycle: prototyping, implementation, testing, deployment, optimization, and documentation.
Collaborate closely with infrastructure, ML engineering, product, and governance teams to deliver production‑ready AI capabilities.
Lead by example through strong execution, high‑quality code, and proactive problem solving.
Influence design direction through technical proposals and hands‑on delivery rather than formal ownership of standards.
Requirements
5+ years of data engineering experience, with at least 1 year in a lead or senior technical role.
Experience building and scaling streaming data pipelines in large-scale, distributed environments.
Strong skills in Python, Java and SQL with expert level skill in either Python or Java.
Proven experience building streaming data pipelines (e.g., Kafka, Flink, Spark, Kinesis).
Experience with embedding pipelines and vector stores (e.g., Pinecone, Weaviate, FAISS, pgvector).
Strong knowledge of data modeling, storage optimization, and retrieval patterns for large-scale systems.
Hands-on experience with workflow orchestration tools (Airflow, Dagster, etc.).
Strong collaboration and communication skills, able to partner across AI engineering, infra, and product teams.
Familiarity with testing, monitoring, and automation for data pipelines.
Tech Stack
Airflow
Java
Kafka
Python
Spark
SQL
Benefits
A bonus and/or long-term incentive units may be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits