Braintrust is an AI observability platform that connects evaluations and observability in one workflow, providing insights into AI performance in production. They are seeking an Eval Engineer to design and run evaluations of new AI capabilities, turning emerging ideas into measurable experiments and publishing results for the developer ecosystem.

Responsibilities:

Design and run evaluations of new AI capabilities
Compare frontier models, agent systems, and tool workflows
Turn emerging ideas into measurable benchmarks
Define datasets, tasks, and scoring logic for experiments
Design realistic workloads that reflect production environments
Create tests that expose failure modes and edge cases
Build evaluation harnesses using Braintrust
Run comparisons across models, prompts, and agent approaches
Analyze traces, outputs, and failure patterns
Invent novel ways to stress test AI systems
Design scenarios that break agents, prompts, and model reasoning
Build adversarial or complex datasets that reveal weaknesses
Write technical posts explaining evaluation methodology and results
Share datasets and scoring logic so experiments are reproducible
Help establish better evaluation patterns for the industry via courses
Develop reusable eval patterns for agents, RAG systems, and LLM apps
Create open source reference implementations developers can adopt
Contribute examples and guides that help teams build better evals

Requirements:

Built or contributed to evaluation systems for LLM or agent applications
Designed experiments comparing models, prompts, or AI architectures
Written Python code to run tests across models or APIs
Built datasets or scoring logic for AI quality measurement
Investigated model failures or unexpected behaviors
Published technical blog posts, research notes, or engineering write-ups
Built prototypes quickly to test ideas
You're an engineer who likes testing systems more than building features
You enjoy breaking things and understanding why they fail
You can design experiments that isolate meaningful differences between approaches
You understand how LLMs, agents, and RAG systems actually work
You write clearly for technical audiences
You ship experiments quickly and iterate often
You care about methodology and reproducibility
You're curious, creative, and opinionated about how AI should be evaluated

Eval Engineer

Key skills

About this role

Responsibilities:

Requirements: