AI Agents Evaluation Safety

Agent Safety & Refusal Evaluation Prompt

Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?

Overview

Agent safety has two failure modes, and most evals only test one. This prompt evaluates both: does the agent refuse the genuinely out-of-bounds requests (harmful, policy-violating, out-of-scope, injection), and does it NOT over-refuse the benign ones that merely pattern-match as risky? It scores refusal correctness in both directions, because an agent that refuses everything is as broken as one that refuses nothing.

Why This Works

  • Testing both directions catches the over-refusal most safety evals ignore
  • Separating false-refusals from missed-refusals shows which way to tune
  • Judging refusal quality avoids the blunt-wall refusals that frustrate users

Best for

  • User-facing agents with safety boundaries
  • Agents exposed to adversarial input
  • Teams tuning the refuse/help balance

Not for

  • Pure correctness evaluation — use the Agent Evaluation Scorecard
  • Generating the adversarial cases — use the Agent Test Scenario Prompt

Use cases

  • Testing agent refusals on harmful and injection inputs
  • Catching over-refusal that hurts benign users
  • Evaluating refusal quality, not just refusal presence

Tip: Save time by exploring related resources and tools that integrate with this workflow.

Explore all resources