Agent Evaluation Scorecard Prompt
Grade agent output the same way every time — a rubric scoring correctness, grounding, safety, tone, and completeness, with a pass threshold instead of a gut call.
Overview
Evaluating agents by reading a few outputs and nodding doesn't scale and isn't consistent. This prompt builds a scorecard: the dimensions that matter (correctness, groundedness, safety, format adherence, tone, completeness), a concrete scale for each, and a weighted pass threshold — so two reviewers grade the same output the same way and 'good enough to ship' is a number.
Why This Works
- A rubric makes two reviewers grade the same output the same way
- Weighting correctness and safety highest reflects what actually matters
- A capping rule stops a polished but unsafe answer from scoring well
Best for
- Teams evaluating agents by ad-hoc reading
- Eval pipelines needing a structured score
- Agents judged on more than just correctness
Not for
- Generating the test inputs — use the Agent Test Scenario Prompt
- Comparing two versions for drift — use the Agent Regression Test Prompt
Use cases
- Scoring agent outputs consistently across reviewers
- Setting a numeric 'ready to ship' threshold
- Comparing prompt versions on the same rubric