Support end‑to‑end data needs for all AI modalities, including classic ML, GenAI/LLMs, and agentic AI systems
Build robust, scalable data pipelines for structured, semi‑structured, and unstructured data, including text, documents, images, audio, video, and logs
Develop feature engineering pipelines for classic ML, including feature extraction, transformation, and feature store management
Build and optimize GenAI and LLM data pipelines, including embedding generation, vectorization, chunking, metadata extraction, and document enrichment for RAG and context retrieval
Develop data ingestion and orchestration workflows that support agentic AI, including memory stores, event-driven pipelines, tool-use data flows, and real-time retrieval services
Design and implement advanced data solutions using AWS (S3, Glue, Lambda, EMR, Kinesis), Databricks (Spark, Delta Lake, Vector Search), and Dataiku to enable intelligent systems at scale
Implement data governance, quality, lineage, monitoring, and observability to support high-performance, trustworthy AI
Partner with data scientists, ML engineers, and AI product teams to deliver datasets for model development, fine‑tuning, evaluation, and production inference
Optimize pipelines for latency, cost, reliability, and throughput, ensuring AI systems—from batch ML to real-time agents—have the data they need
Requirements
Bachelor’s degree in a technical field (CS, Engineering, Math, or related)
Experience supporting AI at scale across classic ML, GenAI/LLM, and agentic AI systems
Experience with vector databases and semantic search (Databricks Vector Search, Pinecone, FAISS, Milvus, OpenSearch)
Familiarity with LLM and GenAI data preparation, including:
Text processing
Tokenization
Chunking strategies
Prompt/context formatting
Experience with unstructured data technologies (OCR, NLP pipelines, computer vision data processing)
Hands-on experience with Dataiku for automation, workflow orchestration, and AI project management
Knowledge of MLOps tooling: MLflow, Delta Lake, experiment tracking, CI/CD for ML
Understanding of agentic AI system patterns, such as memory architectures, tool APIs, event-driven workflows, and reasoning chain data requirements
Strong analytical mindset, attention to detail, and commitment to high data quality
Ability to thrive in a fast-paced, evolving AI environment and collaborate across cross-functional teams
Tech Stack
AWS
Spark
Benefits
employer-subsidized Medical, Dental, Vision, and Life Insurance
Short-Term and Long-Term Disability
401(k) match
Flexible Spending Accounts
Health Savings Accounts
EAP
Educational Assistance
Parental Leave
Paid Time Off (for vacation, personal business, sick time, and parental leave)