Roblox is a platform enabling users to explore, create, and connect through immersive digital experiences. As a Principal Software Engineer on the Compute team, you will lead GPU and AI accelerator capabilities, focusing on ensuring GPU hosts are production-ready and reliable for workloads. Your role will involve technical leadership, lifecycle management of GPU hosts, and driving GPU strategy across the organization.
Responsibilities:
- Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end
- Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults)
- Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads
- Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production
- Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns
- Establish the standards, tooling, and APIs that let other engineering teams consume GPU compute safely and efficiently, reducing toil and raising the bar for the org