Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs.
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents).
Implement partitioning, sharding, caching strategies, and observability (monitoring, logging, alerting) for reliable pipelines.
Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries).
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics.
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction.
Implement validation and quality checks to ensure datasets meet ML training requirements.
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs.
Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments.
Integrate CI/CD best practices for data workflows.
Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines.
Work cross-functionally with ML researchers, graphics/vision engineers, and platform teams.
Embrace versatility: switch between infrastructure-level challenges and asset/data-level problem solving.
Contribute to a culture of fast iteration, pragmatic trade-offs, and collaborative ownership.
Requirements
5+ years of experience in data engineering, distributed systems, or similar.
Strong programming skills in Python (plus Scala/Java/C++ a plus).
Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration.
Proficiency with distributed frameworks (Spark, Dask, Ray, Flink).
Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.).
Experience with workflow orchestration tools (Airflow, Prefect, Dagster).
Tech Stack
Airflow
AWS
Azure
Cloud
Distributed Systems
ETL
Google Cloud Platform
Java
Kubernetes
Python
Ray
Scala
Spark
SQL
Terraform
Benefits
Stock options available for core team members.
Comprehensive health, dental, and vision insurance.