Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundation models. They are seeking an experienced Machine Learning Infrastructure Engineer to help scale their ML training platform by designing and maintaining large-scale ML infrastructure.
Responsibilities:
- Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS
- Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth
- Manage and optimize high-performance computing resources
- Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation
- Optimize model training with techniques like mixed precision, ZeRO, Lora, etc
- Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency
- Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization
- Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access
- Develop strategies for caching training data to optimize performance
- Work closely with ML researchers and data scientists to understand training requirements and bottlenecks
- Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability