Cypress HCM is seeking an Evaluation Engineer to work on porting external benchmarks and developing novel model evaluations. The role involves ensuring high-quality evaluations for new model releases and requires strong engineering skills and attention to detail.

Responsibilities:

Porting new external benchmarks to the teams internal infrastructure so they can be run as part of their evaluation stack for new model releases
Keeping up to date with new evals and benchmarks, pitching the team on porting newly released evals
Performing rigorous quality control for new and existing evals
Implementing novel evaluations to measure dangerous capabilities and safety of frontier models

Requirements:

Strong Python coding experience and writing clean code fast
Working in a small team on a large, shared codebase
Experience designing and building model evaluations
Detail-oriented, with tenacity to dig through transcripts to identify and resolve issues
Ability to quickly and independently learn new skills and frameworks
Team player with strong communication skills
Demonstrated research experience in the evals space
Experience with agentic evaluations and working with Docker

Evaluation Engineer

Key skills

About this role

Responsibilities:

Requirements: