AI Agents

Conversation Quality Evaluation Prompt

Judge the whole conversation, not one reply — evaluate a multi-turn exchange for context retention, coherence, goal completion, and recovery from misunderstanding.

Open in AI Output Validator

Overview

A chat agent can give good individual replies and still fail the conversation — losing context across turns, contradicting itself, or never actually resolving the user's goal. This prompt evaluates the multi-turn exchange as a whole: does it hold context, stay coherent turn to turn, recover when it misunderstands, and reach the user's goal — the qualities single-reply scoring can't see.

How to use this resource

Assemble the full conversation

Gather a complete multi-turn exchange, not a single reply, along with the user goal. The eval judges the conversation as a whole.
Open this resource in AI Output Validator

Load the prompt into AI Output Validator and paste in the transcript. The tool scores context retention, coherence, recovery, and goal completion without you reading turn by turn.
Review the per-dimension scores

Read how the conversation held context, stayed coherent, recovered from misunderstanding, and reached the goal, noting where it slipped.
Feed the weak turns back

Use the low-scoring dimensions to adjust the agent prompt or memory handling, then re-evaluate a fresh conversation.

Why This Works

Session-level qualities (retention, recovery) are invisible to single-reply scoring
Goal completion measures what users actually care about, not reply polish
Counting turns-to-resolution catches the agent that gets there inefficiently

Best for

Conversational agents and chatbots
Support and assistant agents judged on whole sessions
Teams scoring only single replies and missing session-level failures

Not for

Single-turn output scoring — use the Agent Evaluation Scorecard
Generating conversation test cases — use the Agent Test Scenario Prompt

Use cases

Evaluating a multi-turn chatbot or support agent
Catching context loss and contradictions across turns
Measuring whether conversations actually resolve the goal

FAQ

How do I score a whole chatbot conversation instead of a single reply

Feed the full transcript plus the GOAL line, and the prompt scores six dimensions 0-5 each, including CONTEXT RETENTION and GOAL COMPLETION, then names the single turn where it most went wrong. AI Output Validator produces the scored analysis; you run its judgment in your own assistant and decide what the numbers mean for the agent.

Does reaching the goal in ten turns instead of three lower the score

Yes. A RULE states that reaching a goal in ten turns it could do in three is a partial failure, and the GOAL COMPLETION dimension weighs turns to resolution. The AGGREGATE section also records whether the user left with their goal met as yes, partially, or no, so inefficiency shows even when the answer eventually lands.

Why isn't this the right prompt for scoring one agent reply

A RULE tells the evaluator to judge the trajectory, not isolated replies, so a polished reply that ignores earlier turns still fails the CONTEXT RETENTION dimension. Single-turn scoring needs a per-reply scorecard instead; this contract only makes sense against a full multi-turn transcript paired with a stated user goal.

Customize This Resource

Opens this setup in AI Output Validator. Validate to see the score, every issue found, and the repair prompt to send back to the model.

Open in AI Output Validator

Prompt Template

Copy it as-is, or use Open in AI Output Validator to load it pre-filled and customize it with your own context.

ROLE
You are evaluating a full multi-turn conversation with an AI agent — the whole exchange, not one reply.

INPUT
GOAL:
[What the user was trying to accomplish]
CONVERSATION:
[The full multi-turn transcript]

EVALUATE (0–5 each, with reason)
1. CONTEXT RETENTION: does it remember and use earlier turns, or forget/re-ask?
2. COHERENCE: is it consistent turn to turn — no contradictions or drift?
3. GOAL COMPLETION: did the conversation actually accomplish the user's goal, and how efficiently (turns to resolution)?
4. RECOVERY: when it misunderstood or the user corrected it, did it recover gracefully?
5. CLARIFICATION: did it ask when genuinely ambiguous, rather than guessing or over-asking?
6. TONE & CONSISTENCY: appropriate and stable persona across the conversation?

AGGREGATE
- Overall conversation quality with the deciding factor.
- The single turn where it most went wrong (if any), and why.
- Did the user leave with their goal met? (yes / partially / no)

RULES
- Judge the trajectory, not isolated replies — a good reply that ignores prior context still fails retention.
- Efficiency counts: reaching the goal in ten turns it could do in three is a partial failure.

OUTPUT
The per-dimension scores with reasons, the overall judgment, the worst turn, and whether the goal was met.

More resources from AI Output Validator

Resource

Fix Invalid JSON from AI

The JSON won't parse and you can't see why. Deterministic cause-sniffing — trailing commas, single quotes, unclosed brackets — and the repair prompt that fixes it.

Engineering

Resource

Validate AI Output — Catch Format Violations

Paste the response, get the verdict: real JSON parsing, missing-field detection, and a repair prompt for everything found.

Prompt Engineering

Resource

Agent Evaluation Scorecard Prompt

Grade agent output the same way every time — a rubric scoring correctness, grounding, safety, tone, and completeness, with a pass threshold instead of a gut call.

AI Agents

Resources that pair well

Resource

Force JSON Output from AI

Stop getting 'Sure, here is the JSON…' — the output-contract pattern that forces models to return only parseable JSON: schema, example, and a strict rule block.

Prompt Engineering

Resource

Extract Data From Text with AI

Free text in, named fields out. The extraction prompt pattern that turns any unstructured text into consistent, parseable records.

Prompt Engineering

Resource

Classify Support Tickets with AI

Billing, Technical, Account, How-To, Feature Request — ticket triage with definitions that decide the borderline cases for the model.

Support

Related tools

Tool

AI Output Validator

Paste an AI's output and validate it against the expected format — with a repair prompt for every failure found.

Structured Output

Projects that use this resource

Project

Build a Customer Support System with AI

The full path to a support operation, not just a bot — stand up the knowledge base, route the tickets, add the AI agent, integrate your stack, close the feedback loop, evaluate, and deploy.

9 stages Business Systems

Guides for this resource

Guide

How to Evaluate an AI Agent With a Scorecard

Build an evaluation scorecard two people can apply to the same agent run and reach the same number — dimensions that don't measure each other, anchors you can point at, and the failures that shouldn't average out.

Prompt Engineering

Tip: Save time by exploring related resources and tools that integrate with this resource.