Atoms is building the machines that power the next era of progress. They are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers their next-generation autonomous transport models.
Responsibilities:
- Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks
- Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters
- Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance
- Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments
- Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free
- Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle