Nuance Labs is a Series A company focused on building advanced AI avatars with emotional intelligence. They are seeking an experienced Member of Technical Staff to develop and manage distributed training infrastructure for large-scale omni model pretraining, tackling complex challenges in systems and GPU execution.
Responsibilities:
- Own the distributed training stack for omni model pretraining, from 0→1 system design to 1→10 scaling across large GPU clusters
- Build and operate the core training runtime: job orchestration, distributed execution, checkpointing, recovery, monitoring, and debugging for long-running training jobs
- Optimize large-scale training performance across parallelism strategy, GPU communication, memory usage, data throughput, MFU, step time, and end-to-end training efficiency
- Build infrastructure for omni training workloads: high-throughput audio/video/text data loading, temporal alignment, variable sequence handling, multimodal synchronization, and memory-efficient training
- Evolve the platform as model architectures, training recipes, data mixtures, sequence lengths, hardware constraints, and research directions change