Thinking Machines Lab is dedicated to advancing collaborative general intelligence and building AI tools for diverse needs. The role of pre-training researcher involves blending research with data engineering to curate and analyze pre-training datasets for AI models, requiring both theoretical exploration and hands-on experimentation.
Responsibilities:
- Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data
- Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources
- Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly
- Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use
- Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior
- Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia