Thinking Machines Lab is dedicated to advancing collaborative general intelligence and providing AI tools for unique needs. They are seeking a Software Engineer to design, build, and operate a GPU supercomputing environment for large-scale training and inference, ensuring high performance and reliability.

Responsibilities:

Operate and automate large GPU clusters including provisioning, imaging, and capacity planning
Write software that abstracts cluster management and presents a unified interface for training and inference
Extend scheduling/orchestration (Kubernetes, Slurm, or similar) for topology‑aware placement, preemption, quotas, and fair‑share multi‑tenancy
Monitor and improve operational metrics of speed, reliability, and error recovery
Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage
Partner with researchers to unblock scale runs and advise on parallelism and performance trade‑offs

Software Engineer, Supercomputing

Key skills

About this role

Responsibilities: