Thinking Machines Lab is on a mission to empower humanity through advancing collaborative general intelligence. They are seeking an infrastructure research engineer to design and build core systems for scalable and efficient training of large models, ensuring research teams can focus on science without system bottlenecks.
Responsibilities:
- Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads
- Develop high-performance optimizations to maximize throughput and efficiency
- Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures
- Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration
- Collaborate with researchers and engineers to build scalable infrastructure
- Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure