Thinking Machines Lab is focused on advancing collaborative general intelligence and is seeking a Reliability Engineer to maintain their GPU supercomputing fleet. The role involves diagnosing hardware issues, automating monitoring processes, and engaging with vendors to ensure reliability for AI research.

Responsibilities:

Investigate, reproduce, and remediate issues across large GPU clusters
Own the drivers, kernel surface, and diagnostics that span hardware, firmware, and OS
Automate the monitoring of fleet reliability and analyze error rates to validate whether a fix or firmware change measurably reduced failures rather than shifting them around
Drive the firmware lifecycle: tracking, qualification, staged rollout, and regression analysis
Engage vendors directly — GPUs, server OEMs, NIC vendors, and storage vendors — to get real fixes rather than ticket numbers. Manage RMA flows when hardware needs to come out
Monitor and improve GPU hardware health signals and turn them into actionable reliability improvements
Write clear postmortems and vendor cases that move issues forward

Reliability Engineer, Supercomputing

Key skills

About this role

Responsibilities: