Cypress HCM is seeking an Evaluation Engineer to work on porting external benchmarks and developing novel model evaluations. The role involves ensuring high-quality evaluations for new model releases and requires strong engineering skills and attention to detail.
Responsibilities:
- Porting new external benchmarks to the teams internal infrastructure so they can be run as part of their evaluation stack for new model releases
- Keeping up to date with new evals and benchmarks, pitching the team on porting newly released evals
- Performing rigorous quality control for new and existing evals
- Implementing novel evaluations to measure dangerous capabilities and safety of frontier models
Requirements:
- Strong Python coding experience and writing clean code fast
- Working in a small team on a large, shared codebase
- Experience designing and building model evaluations
- Detail-oriented, with tenacity to dig through transcripts to identify and resolve issues
- Ability to quickly and independently learn new skills and frameworks
- Team player with strong communication skills
- Demonstrated research experience in the evals space
- Experience with agentic evaluations and working with Docker