Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. They are seeking a highly technical and visionary LLM Dataset Engineer to lead the strategy, creation, and curation of massive datasets that power their foundation models.

Responsibilities:

Foundation Dataset Strategy: Own the end-to-end creation of pre-training datasets for LLMs. This includes defining the mix of web data, code, books, and technical papers to optimize for downstream model performance
Petabyte-Scale Curation: Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data
Post-Training & Alignment Data: Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO)
Multimodal Expansion: Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency
High-Performance Engineering: Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks
Data Profiling & Analysis: Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the "diet" of the model is mathematically balanced
Synthetic Data Generation: (Added Value) Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement

LLM Dataset Engineer

Key skills

About this role

Responsibilities: