Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. They are seeking an engineer to contribute to data infrastructure, responsible for architecting and scaling core infrastructure for distributed training pipelines and intelligent processing systems over large datasets.
Responsibilities:
- Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities
- Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search
- Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle
- Implement and maintain monitoring and alerting to support platform reliability and performance
- Collaborate with research teams to unlock new features, improve data quality, and accelerate training cycles