Baseten is a company that powers mission-critical inference for dynamic AI companies, and they are seeking a Senior Software Engineer – Model Training. In this role, you will build and maintain infrastructure for large-scale training of foundation models, optimizing GPU utilization and collaborating with cross-functional teams to meet customer needs.
Responsibilities:
- Design, build, and maintain distributed training infrastructure for large-scale foundation models
- Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
- Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
- Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
- Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
- Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
- Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar
Requirements:
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
- 5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
- Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
- Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
- Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
- Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
- Excellent communication skills, with the ability to bridge technical depth and business needs
- Experience building APIs, SDKs, or developer tools for ML workflows
- Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.)
- Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines
- Contributions to open-source distributed training or ML infra projects
- Experience with cloud environments (AWS, GCP, Azure) and container orchestration