Techire AI is looking for a Staff ML Infrastructure Engineer to scale GPU infrastructure and improve cluster reliability. The role involves working on distributed training and GPU infrastructure, ensuring large-scale training is usable for researchers, and collaborating closely with a high-performing team.
Responsibilities:
- Help push an already high-performing team past their current operating level, using your skills and experience to scale training workloads, improve cluster reliability/usage and build systems that hold up under real pressure
- Focus on distributed training and GPU infrastructure, making large-scale training actually usable for researchers—not just possible
- Work across frontier model training, scientific workloads and robotics environments
- Deal with high-throughput systems and real-world constraints, not just controlled experiments
- Join a team that owns compute end-to-end—infra, systems, and operations—working closely with researchers to make training at this scale reliable
Requirements:
- Experience scaling GPU infrastructure from 2,000 to 10,000+ GPUs
- Experience with Ray, Slurm or similar
- Experience supporting core model training