Cohere is on a mission to scale intelligence to serve humanity by training and deploying frontier models for AI systems. They are seeking a Member of Technical Staff, Training Infra Engineer to contribute to model training pipelines and bridge the gap between research and production, ensuring high-performance and scalable software for training.
Responsibilities:
- Design and write high-performant and scalable software for training
- Improve our training setup from an infrastructure and codebase performance standpoint
- Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
- Research, implement, and experiment with ideas on our supercompute and data infrastructure
- Learn from and work with the best researchers in the field
Requirements:
- Extremely strong software engineering skills
- Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
- Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
- Experience using large-scale distributed training strategies
- Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
- Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)