AI Agents

Evaluation Dataset Builder Prompt

Turn real interactions into a labeled eval set — sample for coverage, label each with the expected behavior, and balance the set so the score means something.

Open in Test Case Prompt Generator

Overview

The best evaluation set comes from real usage, but raw logs aren't a dataset — they're unlabeled, unbalanced, and full of duplicates. This prompt builds an eval dataset from real interactions: sampling for coverage across intents and difficulty, labeling each with the expected/correct behavior, deduping near-identical cases, and balancing so common cases don't drown out the rare-but-critical ones.

How to use this resource

Assemble the raw interactions

Gather real usage logs across the range of intents and difficulty. The builder turns them into a labeled set, but it needs coverage to sample from.
Open this resource in Test Case Prompt Generator

Load the prompt into Test Case Prompt Generator and paste in the interactions. It samples for coverage, labels each with the expected behavior, and dedupes near-identical cases.
Review the labeled, balanced set

Check the sampled cases, their expected-behavior labels, and the balance so common cases do not drown out the rare-but-critical ones.
Refine the set and re-sample

Adjust the labels or coverage where the set is thin or skewed, then regenerate so the score it produces actually means something.

Why This Works

Coverage-based sampling stops the eval being dominated by the easy majority
Gold labels turn raw logs into a set you can grade against
A held-out split keeps the score honest as you iterate

Best for

Agents with real usage to learn from
Teams whose eval set is a handful of hand-picked cases
Building a repeatable evaluation pipeline

Not for

Generating synthetic scenarios from scratch — use the Agent Test Scenario Prompt
Scoring the dataset — use the Agent Evaluation Scorecard

Use cases

Turning production logs into a labeled eval set
Sampling interactions for coverage, not just volume
Building a held-out set for honest evaluation

FAQ

Are the labels in the eval set the agent's current answers or the correct ones?

The gold labels are the expected/correct behavior, not the agent's current output: a rule states this outright, so you grade the agent against the set rather than against itself. Each selected case gets a label specific enough to grade against, output as input + gold label + category. The test-case-prompt-generator builds the prompt; you paste real logs, run it in your assistant, and review the labels.

How does the builder stop common cases from dominating the eval score?

Two of its six steps handle that: SAMPLE FOR COVERAGE selects across every category and difficulty tier (easy/typical/hard/edge), noting which are thin, and BALANCE flags when common cases vastly outnumber rare-but-critical ones and recommends a target mix. A HOLD-OUT NOTE reserves cases from the iteration set. Coverage over volume is the governing rule. The generator produces the prompt; you re-sample where the set is skewed.

Can this generate test scenarios if I don't have real logs yet?

It builds a dataset from real interaction logs you paste, sampling and labeling existing usage; it doesn't invent cases from scratch. The notFor routes synthetic scenario generation to the Agent Test Scenario Prompt and scoring the finished set to the Agent Evaluation Scorecard. Without raw interactions there's nothing to sample for coverage. The test-case-prompt-generator produces the labeling prompt; supplying the logs is on you.

Customize This Resource

Opens this setup in Test Case Prompt Generator. Generate to get the full test contract — then adjust the strategy, framework, coverage, and depth.

Open in Test Case Prompt Generator

Prompt Template

Copy it as-is, or use Open in Test Case Prompt Generator to load it pre-filled and customize it with your own context.

ROLE
You are building a labeled evaluation dataset for an AI agent from real interaction logs.

INPUT
RAW INTERACTIONS:
[Paste sample logs / interactions]
AGENT PURPOSE & CORRECT BEHAVIOR:
[What the agent does and what 'correct' means]

BUILD THE DATASET
1. CATEGORIZE: group the interactions by intent/type and by difficulty (easy / typical / hard / edge).
2. SAMPLE FOR COVERAGE: select a set that covers every category — not just the most frequent. Note which categories are thin and need more examples.
3. DEDUPE: collapse near-identical cases so one common phrasing doesn't dominate.
4. LABEL: for each selected case, write the expected/correct behavior (the gold label) — specific enough to grade against.
5. BALANCE: flag if common cases vastly outnumber rare-but-critical ones, and recommend the target mix so the aggregate score isn't dominated by the easy majority.
6. HOLD-OUT NOTE: which cases to reserve as a held-out set vs the iteration set.

RULES
- Coverage over volume: every important category represented beats a big pile of the same case.
- Labels are the expected behavior, not the agent's current output.

OUTPUT
The labeled dataset (input + gold label + category), the coverage gaps, and the recommended balance/hold-out split.

More resources from Test Case Prompt Generator

Resource

Playwright Test Prompt

getByRole over CSS chains, auto-wait over sleep, web-first assertions — Playwright tests written the way Playwright wants.

Engineering

Resource

Unit Test Prompt — Isolation Done Right

Mock the dependencies, test the business logic, one behavior per test — the unit testing contract that bans plumbing tests.

Engineering

Resource

Agent Safety & Refusal Evaluation Prompt

Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?

AI Agents

Resources that pair well

Resource

Code Review Prompt — the Review Contract

"Review this code" gets shallow comments. The review contract gets findings with severities, a checklist, and a verdict.

Prompt Engineering

Resource

Debugging Prompt — the Investigation Contract

"Fix this error" gets guesses. The investigation contract gets a ten-stage diagnosis: facts separated from assumptions, alternatives weighed, fixes justified.

Prompt Engineering

Resource

Fix Invalid JSON from AI

The JSON won't parse and you can't see why. Deterministic cause-sniffing — trailing commas, single quotes, unclosed brackets — and the repair prompt that fixes it.

Engineering

Related tools

Tool

Test Case Prompt Generator

Build test generation prompts — unit, integration, or E2E — with framework modes and edge-case coverage rules.

Coding Workflows

Guides for this resource

Guide

How to Write Test Scenarios for an AI Agent

Turn an agent's instructions, allowed tools and business rules into scenarios you can actually run — a fixed situation, an expected behavior you can observe, and a pass or fail that two people would agree on.

Prompt Engineering

Tip: Save time by exploring related resources and tools that integrate with this resource.