Agent Regression Test Prompt
Make sure a fix didn't break three other things — compare an agent's outputs across two versions on the same scenarios and flag every behavior that changed for the worse.
Overview
Tuning an agent is whack-a-mole: improving one case silently regresses others, because the change is a prompt edit with no compiler to catch the fallout. This prompt runs a regression check — the same scenario set against the old and new version — and classifies each change as improvement, neutral, or regression, surfacing the behaviors that got worse so a fix doesn't ship a hidden break.
Why This Works
- A version diff on the same set catches the silent regression a spot-check misses
- Classifying each change makes the trade-off of a fix explicit
- A critical-regression veto stops net-negative changes from shipping
Best for
- Iterating on an agent's prompt or model
- Any agent change heading to production
- Teams that fix one case and break another
Not for
- Building the scenarios — use the Agent Test Scenario Prompt
- Scoring a single version — use the Agent Evaluation Scorecard
Use cases
- Checking a prompt change for hidden regressions
- Comparing two agent versions before shipping
- Documenting what a change improved and what it broke