Reflection AI is on a mission to build open superintelligence and make it accessible to all. They are seeking a Member of Technical Staff to build and scale distributed training systems for foundation models, collaborating closely with research teams and optimizing large-scale training workloads.
Responsibilities:
- Build and scale distributed training systems that power frontier model pre-training
- Work closely with research teams to design and operate large-scale training runs for foundation models
- Develop infrastructure that enables efficient training across thousands of GPUs using modern distributed training frameworks
- Optimize training throughput, stability, and efficiency for large model training workloads
- Collaborate directly with pre-training researchers to translate experimental ideas into scalable, production-ready training systems
- Improve performance of distributed training workloads through optimization of communication, memory usage, and GPU utilization
- Build and maintain training pipelines that support large-scale datasets, checkpointing, and experiment iteration
- Debug and resolve performance bottlenecks across distributed training stacks including model parallelism, GPU communication, and training runtime systems
- Contribute to the development of systems that enable rapid experimentation and iteration on new training techniques