Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. They are seeking an engineer to design, build, and operate the GPU supercomputing environment that powers large-scale training and inference.

Responsibilities:

Operate and automate large GPU clusters including provisioning, imaging, and capacity planning
Write software that abstracts cluster management and presents a unified interface for training and inference
Extend scheduling/orchestration (Kubernetes, Slurm, or similar) for topology‑aware placement, preemption, quotas, and fair‑share multi‑tenancy
Monitor and improve operational metrics of speed, reliability, and error recovery
Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage
Partner with researchers to unblock scale runs and advise on parallelism and performance trade‑offs

Software Engineer, Supercomputing

Key skills

About this role

Responsibilities: