Evaluation Dataset Builder Prompt
Turn real interactions into a labeled eval set — sample for coverage, label each with the expected behavior, and balance the set so the score means something.
Overview
The best evaluation set comes from real usage, but raw logs aren't a dataset — they're unlabeled, unbalanced, and full of duplicates. This prompt builds an eval dataset from real interactions: sampling for coverage across intents and difficulty, labeling each with the expected/correct behavior, deduping near-identical cases, and balancing so common cases don't drown out the rare-but-critical ones.
Why This Works
- Coverage-based sampling stops the eval being dominated by the easy majority
- Gold labels turn raw logs into a set you can grade against
- A held-out split keeps the score honest as you iterate
Best for
- Agents with real usage to learn from
- Teams whose eval set is a handful of hand-picked cases
- Building a repeatable evaluation pipeline
Not for
- Generating synthetic scenarios from scratch — use the Agent Test Scenario Prompt
- Scoring the dataset — use the Agent Evaluation Scorecard
Use cases
- Turning production logs into a labeled eval set
- Sampling interactions for coverage, not just volume
- Building a held-out set for honest evaluation