Cohere is a company focused on scaling intelligence to serve humanity by training and deploying advanced AI models. The Data Engineer will manage the end-to-end data pipeline for language models, ensuring data quality and optimal performance.
Responsibilities:
- Design and build scalable data pipelines to ingest, parse, filter, and optimize diverse web datasets
- Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance
- Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency
- Research and implement innovative data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing
- Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models
Requirements:
- Strong software engineering skills, with proficiency in Python and experience building data pipelines
- Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
- Experience working with large-scale web datasets like CommonCrawl
- A passion for bridging research and engineering to solve complex data-related challenges in AI model training
- Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)