Cypress HCM is focused on engineering innovative model evaluations and porting external benchmarks to their internal infrastructure. The Evaluation Engineer will be responsible for implementing novel evaluations, performing quality control, and keeping the team updated with new benchmarks.
Responsibilities:
- Porting new external benchmarks to the teams internal infrastructure so they can be run as part of their evaluation stack for new model releases
- Keeping up to date with new evals and benchmarks, pitching the team on porting newly released evals
- Performing rigorous quality control for new and existing evals
- Implementing novel evaluations to measure dangerous capabilities and safety of frontier models
Requirements:
- Strong Python coding experience and writing clean code fast
- Working in a small team on a large, shared codebase
- Experience designing and building model evaluations
- Detail-oriented, with tenacity to dig through transcripts to identify and resolve issues
- Ability to quickly and independently learn new skills and frameworks
- Team player with strong communication skills
- Demonstrated research experience in the evals space
- Experience with agentic evaluations and working with Docker