About this roleAbout the Team
The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.
Responsibilities
- Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
- Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
- Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
- Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
- Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively
- Participate in global team rotations for system monitoring, on-call support, and incident response
The base salary range for this position in the selected city is $198360 - $416100 annually.