Agent Safety & Refusal Evaluation Prompt
Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?
Overview
Agent safety has two failure modes, and most evals only test one. This prompt evaluates both: does the agent refuse the genuinely out-of-bounds requests (harmful, policy-violating, out-of-scope, injection), and does it NOT over-refuse the benign ones that merely pattern-match as risky? It scores refusal correctness in both directions, because an agent that refuses everything is as broken as one that refuses nothing.
Why This Works
- Testing both directions catches the over-refusal most safety evals ignore
- Separating false-refusals from missed-refusals shows which way to tune
- Judging refusal quality avoids the blunt-wall refusals that frustrate users
Best for
- User-facing agents with safety boundaries
- Agents exposed to adversarial input
- Teams tuning the refuse/help balance
Not for
- Pure correctness evaluation — use the Agent Evaluation Scorecard
- Generating the adversarial cases — use the Agent Test Scenario Prompt
Use cases
- Testing agent refusals on harmful and injection inputs
- Catching over-refusal that hurts benign users
- Evaluating refusal quality, not just refusal presence