About this role

Techire AI is looking for a Staff ML Infrastructure Engineer to scale GPU infrastructure and improve cluster reliability. The role involves working on distributed training and GPU infrastructure, ensuring large-scale training is usable for researchers, and collaborating closely with a high-performing team.

Responsibilities:

Help push an already high-performing team past their current operating level, using your skills and experience to scale training workloads, improve cluster reliability/usage and build systems that hold up under real pressure
Focus on distributed training and GPU infrastructure, making large-scale training actually usable for researchers—not just possible
Work across frontier model training, scientific workloads and robotics environments
Deal with high-throughput systems and real-world constraints, not just controlled experiments
Join a team that owns compute end-to-end—infra, systems, and operations—working closely with researchers to make training at this scale reliable

Requirements:

Experience scaling GPU infrastructure from 2,000 to 10,000+ GPUs
Experience with Ray, Slurm or similar
Experience supporting core model training

Staff ML Infrastructure Engineer

Key skills

About this role

Responsibilities:

Requirements: