Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. The role of pre-training researchers is to blend research with large-scale data engineering to assemble pre-training datasets that support the next generation of AI models.

Responsibilities:

Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior
Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia

Research, Pre-Training Data

Key skills

About this role

Responsibilities: