About this role

Crossing Hurdles is seeking a PhD Rater for a part-time position focused on designing and evaluating real-world STEM benchmark problems. The role involves implementing tasks in Python, analyzing AI model behavior, and improving benchmark quality for AI models.

Responsibilities:

Design challenging, real-world STEM benchmark problems in domains such as data science, machine learning, finance, and software engineering
Implement tasks within an agentic development environment using Python
Create reproducible problem setups with clear specifications and executable tests
Evaluate and analyze AI model behavior, including reasoning traces and agent workflows
Diagnose reasoning failures, logic gaps, and problem-solving limitations in AI systems
Contribute to improving benchmark quality and evaluation frameworks for frontier AI models

Requirements:

Active or recently graduated PhD
Deep expertise in data science, machine learning, finance, and/or Python-based software development
Strong research background in advanced STEM topics
Ability to commit reliably for 30+ hours per week
Demonstrated technical output such as high-quality open-source contributions or research work
Ability to analyze agent behavior traces and diagnose failures beyond surface-level errors

Machine Learning Engineer | Remote

Key skills

About this role

Responsibilities:

Requirements: