Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. The role of pre-training researchers is to blend research with large-scale data engineering to assemble pre-training datasets that support the next generation of AI models.
Responsibilities:
- Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data
- Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources
- Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly
- Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use
- Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior
- Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia