Zoox is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. They are seeking a Senior ML Storage Infrastructure Engineer to work on custom High-Performance Computing infrastructure that supports machine learning workflows across various software divisions. The role involves designing and optimizing storage infrastructure, driving GPU efficiency, and creating essential tools for software teams.
Responsibilities:
- Design, build, and optimize a petabyte-scale, in-house HPC storage infrastructure, ensuring high performance and reliability for our machine learning workloads across both cloud and on-premise data centers
- Drive GPU efficiency by strategically collocating storage and compute, architecting a storage layer that keeps tens of thousands of GPUs fully utilized and prevents bottlenecks
- Drive key initiatives in training and storage optimization by partnering with ML practitioners, applying your deep understanding of frameworks such as PyTorch and TensorFlow to meet their evolving demands
- Investigate and adopt new distributed system paradigms and cutting-edge technologies to ensure our infrastructure can scale to meet ever-growing computational and storage demands
- Create production-grade web service APIs, SDKs, and other essential tools to deliver a world-class developer experience for all software teams at Zoox