DRS IT Solutions Inc is seeking a Data Engineer to support their mission in intelligent driving. The role involves designing and maintaining data pipelines, collaborating with research scientists and machine learning engineers to ensure the reliability and performance of data systems.
Responsibilities:
- Design, implement, and maintain robust data pipelines for ingesting, cleaning, and transforming large-scale autonomous vehicle datasets (camera, LiDAR, radar, GPS, simulation logs)
- Develop scalable storage and retrieval systems using AWS services (S3, EC2, SageMaker, Athena, etc.)
- Ensure data quality and consistency through automated validation, deduplication, and schema enforcement
- Collaborate with ML researchers and engineers to provide efficient access to training data, labels, and metadata
- Optimize data preprocessing and batching pipelines to support large-scale training and evaluation workflows
- Build tools to manage and audit dataset versions, experiment tracking, and feature reproducibility
- Implement and maintain CI/CD workflows for data and pipeline updates, ensuring minimal downtime and reproducible outputs
- Monitor data pipeline performance and respond to bottlenecks or outages proactively
Requirements:
- B.S. or M.S. in Computer Science, Data Engineering, or a related field
- 3+ years of experience building production-grade data infrastructure or ML data pipelines
- Strong proficiency with Python and SQL, and experience with data workflow orchestration tools (e.g., Airflow, Prefect, Luigi)
- Deep experience with AWS services, especially S3 (data storage), EC2 (compute), and SageMaker (model training)
- Familiarity with distributed computing frameworks like Spark, Dask, or Ray
- Understanding of best practices for dataset documentation, standardization, and reproducibility in research
- Experience with autonomous vehicle datasets or robotics sensor data
- Familiarity with ML training pipelines and model evaluation workflows
- Prior experience collaborating with researchers or applied ML teams in high-throughput environments