Vmax is an applied research lab developing AI capable of open-ended learning. They are seeking strong infrastructure engineers to build systems for RL at scale, focusing on distributed rollouts, training orchestration, and data pipelines.

Responsibilities:

Build infrastructure for distributed RL training and inference across thousands of GPUs
Improve the reliability, debuggability, and throughput of RL experiments
Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily
Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance
Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization
Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility

Member of Technical Staff - RL Infrastructure

Key skills

About this role

Responsibilities: