Thinking Machines Lab is focused on advancing collaborative general intelligence and creating accessible AI tools. They are seeking an Infrastructure Research Engineer to design and build systems for scalable reinforcement learning model training, collaborating closely with researchers and infrastructure teams.
Responsibilities:
- Design, build, and optimize the infrastructure that powers large-scale reinforcement learning and post-training workloads
- Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput
- Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems
- Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines
- Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality
- Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure