Thinking Machines Lab is dedicated to advancing collaborative general intelligence and providing AI tools for unique needs. They are seeking a Software Engineer to design, build, and operate a GPU supercomputing environment for large-scale training and inference, ensuring high performance and reliability.
Responsibilities:
- Operate and automate large GPU clusters including provisioning, imaging, and capacity planning
- Write software that abstracts cluster management and presents a unified interface for training and inference
- Extend scheduling/orchestration (Kubernetes, Slurm, or similar) for topology‑aware placement, preemption, quotas, and fair‑share multi‑tenancy
- Monitor and improve operational metrics of speed, reliability, and error recovery
- Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage
- Partner with researchers to unblock scale runs and advise on parallelism and performance trade‑offs