About this role

Focus on the full training stack
profiling GPU behavior, debugging training pipelines
Improve throughput, choosing the right parallelism strategies
Design the infrastructure for efficient model training at scale
Work across cluster management, model training, efficient data pipelines, inference and optimizing PyTorch code

Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them
Deep understanding of GPU memory hierarchy and computation capabilities
Experience optimizing for both memory-bound and compute-bound operations
Expertise with efficient attention algorithms and their performance characteristics at different scales
Nice to Have: Experience in implementing custom GPU kernels and integrating them into PyTorch
Familiarity with high-performance storage solutions and understanding of their performance characteristics for ML workloads
Experience with managing SLURM clusters at scale

Training Infrastructure Engineer

Key skills