Develop high scale, reliable data extraction pipeline to extract millions of raw data from data collection fleet and convert to high-value scene data
Develop data labeling pipelines to perform the auto labeling inferences for autonomous driving algorithms
Develop advanced autonomous driving data SDK, including scene data search, datasets preparation, dataset loading, etc.
Build up the data lakehouse for autonomous driving scene dataset, including the sensor data, calibration data, as well as annotation data
Dig into performance bottlenecks all along the data processing pipelines, from data processing latency, data search latency to Test Procedure (TP) coverage.
Bootstrap and maintain infrastructure for data platform components—data processing pipeline, database, data lakehouse and data serving.
Collaborate with cross-functional teams, including ML algorithm, ML application, and Cloud Infra to align data pipelines with overall autonomous driving system architecture.
Requirements
Bachelor's degree or higher in Computer Science, Engineering, Robotics, or a similar technical field.
Minimum of 7 years of experience in Data Engineering, DataOps or ML Platform roles
Proficient in Python and solid experience in Python SDK development
Solid hands-on experience with data pipeline job orchestration with Databricks Workflows or Apache Airflow, as well as integrating data pipelines with machine learning models
Solid working experience in Databases (e.g., MongoDB, PostgreSQL, etc)
Extensive experience with data technologies and architectures such as Data Warehouse (e.g., Hive) or Lakehouse (e.g., Delta Lake)
Experience with Apache Spark or other big data computing engines
Excellent leadership and communication skills, with a demonstrated ability to lead technical projects.