Thinking Machines Lab is focused on advancing collaborative general intelligence and is seeking a Reliability Engineer to maintain their GPU supercomputing fleet. The role involves diagnosing hardware issues, automating monitoring processes, and engaging with vendors to ensure reliability for AI research.
Responsibilities:
- Investigate, reproduce, and remediate issues across large GPU clusters
- Own the drivers, kernel surface, and diagnostics that span hardware, firmware, and OS
- Automate the monitoring of fleet reliability and analyze error rates to validate whether a fix or firmware change measurably reduced failures rather than shifting them around
- Drive the firmware lifecycle: tracking, qualification, staged rollout, and regression analysis
- Engage vendors directly — GPUs, server OEMs, NIC vendors, and storage vendors — to get real fixes rather than ticket numbers. Manage RMA flows when hardware needs to come out
- Monitor and improve GPU hardware health signals and turn them into actionable reliability improvements
- Write clear postmortems and vendor cases that move issues forward