Crossing Hurdles is seeking a PhD Rater for a part-time position focused on designing and evaluating real-world STEM benchmark problems. The role involves implementing tasks in Python, analyzing AI model behavior, and improving benchmark quality for AI models.
Responsibilities:
- Design challenging, real-world STEM benchmark problems in domains such as data science, machine learning, finance, and software engineering
- Implement tasks within an agentic development environment using Python
- Create reproducible problem setups with clear specifications and executable tests
- Evaluate and analyze AI model behavior, including reasoning traces and agent workflows
- Diagnose reasoning failures, logic gaps, and problem-solving limitations in AI systems
- Contribute to improving benchmark quality and evaluation frameworks for frontier AI models
Requirements:
- Active or recently graduated PhD
- Deep expertise in data science, machine learning, finance, and/or Python-based software development
- Strong research background in advanced STEM topics
- Ability to commit reliably for 30+ hours per week
- Demonstrated technical output such as high-quality open-source contributions or research work
- Ability to analyze agent behavior traces and diagnose failures beyond surface-level errors