profiling GPU behavior, debugging training pipelines
Improve throughput, choosing the right parallelism strategies
Design the infrastructure for efficient model training at scale
Work across cluster management, model training, efficient data pipelines, inference and optimizing PyTorch code
Requirements
Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them
Deep understanding of GPU memory hierarchy and computation capabilities
Experience optimizing for both memory-bound and compute-bound operations
Expertise with efficient attention algorithms and their performance characteristics at different scales
Nice to Have: Experience in implementing custom GPU kernels and integrating them into PyTorch
Familiarity with high-performance storage solutions and understanding of their performance characteristics for ML workloads