Research Engineer – Multimodal Training Infrastructure (Seed Infra)
San Jose, California, United States of America
Full Time
1 day ago
$208,800 - $438,000 USD
Key skills
AICommunication
About this role
About the team The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities - Conduct research and development on large-scale infrastructure to enable efficient training of foundation models, multimodal LLMs, and image/video generation models - Design and optimize distributed training strategies for multimodal LLMs, including parallelism schemes, computation and communication optimization, and throughput scaling on large GPU clusters - Investigate system reliability and resilience techniques, such as fast checkpointing, fault tolerance, and failure diagnosis for long-running training workloads - Research and optimize network, scheduling, and GPU memory management across the training stack, driving cross-layer performance improvements - Analyze performance bottlenecks in exascale training systems and propose principled, data-driven optimization methods - Bridge cutting-edge research and large-scale production deployment by translating research ideas into scalable, real-world infrastructure solutions
The base salary range for this position in the selected city is $208800 - $438000 annually.