Atoms is building the machines that power the next era of progress. They are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers their next-generation autonomous transport models.

Responsibilities:

Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks
Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters
Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance
Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments
Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free
Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle

Staff Machine Learning Infrastructure Engineer

Key skills

About this role

Responsibilities: