Thinking Machines Lab is dedicated to advancing collaborative general intelligence and building AI tools for diverse needs. The role of pre-training researcher involves blending research with data engineering to curate and analyze pre-training datasets for AI models, requiring both theoretical exploration and hands-on experimentation.

Responsibilities:

Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior
Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia

Research, Pre-Training Data

Key skills

About this role

Responsibilities: