Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundation models. They are seeking an experienced Machine Learning Infrastructure Engineer to help scale their ML training platform by designing and maintaining large-scale ML infrastructure.

Responsibilities:

Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS
Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth
Manage and optimize high-performance computing resources
Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation
Optimize model training with techniques like mixed precision, ZeRO, Lora, etc
Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency
Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization
Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access
Develop strategies for caching training data to optimize performance
Work closely with ML researchers and data scientists to understand training requirements and bottlenecks
Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability

Staff Machine Learning Infrastructure Engineer

Key skills

About this role

Responsibilities: