About this role

Zoox is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. They are seeking a Senior ML Storage Infrastructure Engineer to work on custom High-Performance Computing infrastructure that supports machine learning workflows across various software divisions. The role involves designing and optimizing storage infrastructure, driving GPU efficiency, and creating essential tools for software teams.

Responsibilities:

Design, build, and optimize a petabyte-scale, in-house HPC storage infrastructure, ensuring high performance and reliability for our machine learning workloads across both cloud and on-premise data centers
Drive GPU efficiency by strategically collocating storage and compute, architecting a storage layer that keeps tens of thousands of GPUs fully utilized and prevents bottlenecks
Drive key initiatives in training and storage optimization by partnering with ML practitioners, applying your deep understanding of frameworks such as PyTorch and TensorFlow to meet their evolving demands
Investigate and adopt new distributed system paradigms and cutting-edge technologies to ensure our infrastructure can scale to meet ever-growing computational and storage demands
Create production-grade web service APIs, SDKs, and other essential tools to deliver a world-class developer experience for all software teams at Zoox

Senior ML Storage Infrastructure Engineer

Key skills

About this role

Responsibilities: