Member of Engineering – Pre-training, Data Research
Europe
Full Time
5 hours ago
No Visa Sponsorship
Key skills
PythonMachine LearningDeep LearningLLMLarge Language ModelsAgenticRemote Work
About this role
Role Overview
You’ll be working on our data team focused on the quality of the datasets being delivered for training our models.
This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments.
This includes synthetic data generation and data mix optimization.
You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases.
Staying in sync with the latest research in the fields of dataset design and pretraining is key to success in this role.
You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production.
With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal.
Requirements
Strong machine learning and engineering background
Experience with Large Language Models (LLM), including:
Understanding of transformer architectures and how LLMs learn
Data ablations and scaling laws
Mid-training and Post-training techniques
Training reasoning and agentic models
Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc)
Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc.
Excellent programming skills in Python
Strong prompt engineering skills
Experience working with large-scale GPU clusters and distributed data pipelines
Strong obsession with data quality
Research experience:
Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc.
is a nice to have
Can freely discuss the latest papers and descend to fine details
Is reasonably opinionated
Tech Stack
Python
Benefits
Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you & dependents
Company-provided equipment
Well-being, always-be-learning & home office allowances