Thinking Machines Lab is focused on advancing collaborative general intelligence and creating accessible AI tools. They are seeking an Infrastructure Research Engineer to design and build systems for scalable reinforcement learning model training, collaborating closely with researchers and infrastructure teams.

Responsibilities:

Design, build, and optimize the infrastructure that powers large-scale reinforcement learning and post-training workloads
Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput
Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems
Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines
Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure

Research Engineer, Infrastructure, RL Systems

Key skills

About this role

Responsibilities: