Vmax is an applied research lab developing AI capable of open-ended learning. They are seeking strong infrastructure engineers to build systems for RL at scale, focusing on distributed rollouts, training orchestration, and data pipelines.
Responsibilities:
- Build infrastructure for distributed RL training and inference across thousands of GPUs
- Improve the reliability, debuggability, and throughput of RL experiments
- Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily
- Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance
- Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization
- Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility