Agent Tool-Use Evaluation Prompt
Check that the agent calls tools right — the correct tool, valid arguments, the right time, and graceful handling when a tool fails or returns nothing.
Overview
An agent that reasons well but calls tools badly is still broken — it picks the wrong function, passes malformed arguments, calls when it shouldn't, or falls apart when a tool errors. This prompt evaluates tool use specifically: selection, argument correctness, timing, and failure handling, across the scenarios where each goes wrong.
Why This Works
- Tool-use failures break agents that reason perfectly otherwise
- Testing failure handling catches the hallucinated-result-on-error bug
- Checking the no-tool cases catches over-eager calling
Best for
- Agents that call functions, APIs, or tools
- Multi-tool workflows with chained calls
- Pre-production evaluation of agent tool use
Not for
- Agents with no tools
- Evaluating answer quality alone — use the Agent Evaluation Scorecard
Use cases
- Evaluating a tool-using or function-calling agent
- Finding wrong-tool and bad-argument failures
- Testing how the agent handles tool errors