Braintrust is an AI observability platform that connects evaluations and observability in one workflow, providing insights into AI performance in production. They are seeking an Eval Engineer to design and run evaluations of new AI capabilities, turning emerging ideas into measurable experiments and publishing results for the developer ecosystem.
Responsibilities:
- Design and run evaluations of new AI capabilities
- Compare frontier models, agent systems, and tool workflows
- Turn emerging ideas into measurable benchmarks
- Define datasets, tasks, and scoring logic for experiments
- Design realistic workloads that reflect production environments
- Create tests that expose failure modes and edge cases
- Build evaluation harnesses using Braintrust
- Run comparisons across models, prompts, and agent approaches
- Analyze traces, outputs, and failure patterns
- Invent novel ways to stress test AI systems
- Design scenarios that break agents, prompts, and model reasoning
- Build adversarial or complex datasets that reveal weaknesses
- Write technical posts explaining evaluation methodology and results
- Share datasets and scoring logic so experiments are reproducible
- Help establish better evaluation patterns for the industry via courses
- Develop reusable eval patterns for agents, RAG systems, and LLM apps
- Create open source reference implementations developers can adopt
- Contribute examples and guides that help teams build better evals
Requirements:
- Built or contributed to evaluation systems for LLM or agent applications
- Designed experiments comparing models, prompts, or AI architectures
- Written Python code to run tests across models or APIs
- Built datasets or scoring logic for AI quality measurement
- Investigated model failures or unexpected behaviors
- Published technical blog posts, research notes, or engineering write-ups
- Built prototypes quickly to test ideas
- You're an engineer who likes testing systems more than building features
- You enjoy breaking things and understanding why they fail
- You can design experiments that isolate meaningful differences between approaches
- You understand how LLMs, agents, and RAG systems actually work
- You write clearly for technical audiences
- You ship experiments quickly and iterate often
- You care about methodology and reproducibility
- You're curious, creative, and opinionated about how AI should be evaluated