AI Agents Evaluation Conversation

Conversation Quality Evaluation Prompt

Judge the whole conversation, not one reply — evaluate a multi-turn exchange for context retention, coherence, goal completion, and recovery from misunderstanding.

Overview

A chat agent can give good individual replies and still fail the conversation — losing context across turns, contradicting itself, or never actually resolving the user's goal. This prompt evaluates the multi-turn exchange as a whole: does it hold context, stay coherent turn to turn, recover when it misunderstands, and reach the user's goal — the qualities single-reply scoring can't see.

Why This Works

  • Session-level qualities (retention, recovery) are invisible to single-reply scoring
  • Goal completion measures what users actually care about, not reply polish
  • Counting turns-to-resolution catches the agent that gets there inefficiently

Best for

  • Conversational agents and chatbots
  • Support and assistant agents judged on whole sessions
  • Teams scoring only single replies and missing session-level failures

Not for

  • Single-turn output scoring — use the Agent Evaluation Scorecard
  • Generating conversation test cases — use the Agent Test Scenario Prompt

Use cases

  • Evaluating a multi-turn chatbot or support agent
  • Catching context loss and contradictions across turns
  • Measuring whether conversations actually resolve the goal

Tip: Save time by exploring related resources and tools that integrate with this workflow.

Explore all resources