Design, implement and train large-scale multimodal generative models for audio generation (diffusion and/or autoregressive models).
Explore new modeling ideas for audio generation (music, sound, speech) while taking inspiration from the language and image domains.
Develop and experiment with post-training for new capabilities (fine-grained control, in/out-painting, editing, …)
Conduct rigorous ablation studies, get actionable insights and communicate results to the team to discuss new research directions.
Contribute hands-on to all stages of model development including data curation, experimentation, evaluation, and deployment.
Requirements
Hands-on experience in training large-scale generative models in a fast-paced research environment.
Deep understanding of cutting-edge methods and ML research in at least one of the domains: image, language, video or audio (specific audio experience not necessary, but nice to have).
Strong proficiency in PyTorch, transformer architectures, and the full ecosystem of modern deep learning.
Solid understanding of distributed training techniques—FSDP, low precision training, model parallelism
Strong track-record in working on generative models (publications in top-tier venues, open-source contributions or applied ML projects).