Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. They are seeking an engineer to design, build, and operate the GPU supercomputing environment that powers large-scale training and inference.
Responsibilities:
- Operate and automate large GPU clusters including provisioning, imaging, and capacity planning
- Write software that abstracts cluster management and presents a unified interface for training and inference
- Extend scheduling/orchestration (Kubernetes, Slurm, or similar) for topology‑aware placement, preemption, quotas, and fair‑share multi‑tenancy
- Monitor and improve operational metrics of speed, reliability, and error recovery
- Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage
- Partner with researchers to unblock scale runs and advise on parallelism and performance trade‑offs