AI Agents

Agent Tool-Use Evaluation Prompt

Check that the agent calls tools right — the correct tool, valid arguments, the right time, and graceful handling when a tool fails or returns nothing.

Open in Test Case Prompt Generator

Overview

An agent that reasons well but calls tools badly is still broken — it picks the wrong function, passes malformed arguments, calls when it shouldn't, or falls apart when a tool errors. This prompt evaluates tool use specifically: selection, argument correctness, timing, and failure handling, across the scenarios where each goes wrong.

How to use this resource

Assemble the tool-call traces

Gather the agent runs that exercise its tools, including the cases where a tool fails or returns nothing. The eval checks selection, arguments, timing, and failure handling.
Open this resource in Test Case Prompt Generator

Load the prompt into Test Case Prompt Generator and fill in the agent tools. It builds the scenarios where each kind of tool-use mistake shows up.
Review the tool-use findings

Read how the agent scored on picking the right tool, passing valid arguments, calling at the right time, and handling errors gracefully.
Fix the tool logic and re-run

Use the failures to adjust the agent tool instructions or guardrails, then re-test the scenarios where it slipped.

Why This Works

Tool-use failures break agents that reason perfectly otherwise
Testing failure handling catches the hallucinated-result-on-error bug
Checking the no-tool cases catches over-eager calling

Best for

Agents that call functions, APIs, or tools
Multi-tool workflows with chained calls
Pre-production evaluation of agent tool use

Not for

Agents with no tools
Evaluating answer quality alone — use the Agent Evaluation Scorecard

Use cases

Evaluating a tool-using or function-calling agent
Finding wrong-tool and bad-argument failures
Testing how the agent handles tool errors

FAQ

What inputs do I need to evaluate an agent's tool calls with this prompt?

Two things: the TOOLS AVAILABLE block — each tool's purpose and argument schema — and a SCENARIO + AGENT TRACE showing the task and the agent's actual tool calls and reasoning. Include runs where a tool errors or returns nothing, since FAILURE HANDLING is one of the six dimensions. The Test Case Prompt Generator builds the eval; you run it in your assistant and own the verdict.

How does this scoring treat an agent that makes up a result when a tool fails?

The rules mark that the most serious case: "Hallucinating a tool result instead of handling the failure is CRITICAL," while a malformed-but-recovered call is only MINOR and a wrong-tool choice is MAJOR/CRITICAL. Each of the six dimensions gets a PASS/FAIL with the specific issue. These are the prompt's own severity labels, not a certified audit — you interpret the findings and decide what to fix.

Customize This Resource

Opens this setup in Test Case Prompt Generator. Generate to get the full test contract — then adjust the strategy, framework, coverage, and depth.

Open in Test Case Prompt Generator

Prompt Template

Copy it as-is, or use Open in Test Case Prompt Generator to load it pre-filled and customize it with your own context.

ROLE
You are evaluating how well an AI agent uses its tools (function calls), not just how it reasons.

INPUT
TOOLS AVAILABLE:
[The tools/functions, their purpose, and their argument schemas]
SCENARIO + AGENT TRACE:
[The task and the agent's tool calls / reasoning]

EVALUATE
1. SELECTION: did it choose the right tool for the task — not a wrong-but-plausible one, and not skipping a needed tool?
2. ARGUMENTS: are the arguments valid against the schema, correctly typed, and semantically right (not just well-formed)?
3. TIMING: did it call at the right point — not prematurely (before it had the inputs) or redundantly (re-calling needlessly)?
4. FAILURE HANDLING: when a tool errors, returns empty, or times out, does the agent recover gracefully or break / hallucinate a result?
5. NO-TOOL CASES: did it correctly NOT call a tool when it should answer directly?
6. CHAINING: for multi-tool tasks, is the sequence and data hand-off correct?

FOR EACH, report PASS/FAIL with the specific issue.

RULES
- A malformed-but-recovered call is a MINOR; a wrong-tool or unhandled failure is MAJOR/CRITICAL.
- Hallucinating a tool result instead of handling the failure is CRITICAL.

OUTPUT
The per-dimension verdicts with issues, and the overall tool-use assessment.

More resources from Test Case Prompt Generator

Resource

Playwright Test Prompt

getByRole over CSS chains, auto-wait over sleep, web-first assertions — Playwright tests written the way Playwright wants.

Engineering

Resource

Unit Test Prompt — Isolation Done Right

Mock the dependencies, test the business logic, one behavior per test — the unit testing contract that bans plumbing tests.

Engineering

Resource

Agent Safety & Refusal Evaluation Prompt

Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?

AI Agents

Resources that pair well

Resource

Code Review Prompt — the Review Contract

"Review this code" gets shallow comments. The review contract gets findings with severities, a checklist, and a verdict.

Prompt Engineering

Resource

Debugging Prompt — the Investigation Contract

"Fix this error" gets guesses. The investigation contract gets a ten-stage diagnosis: facts separated from assumptions, alternatives weighed, fixes justified.

Prompt Engineering

Resource

Fix Invalid JSON from AI

The JSON won't parse and you can't see why. Deterministic cause-sniffing — trailing commas, single quotes, unclosed brackets — and the repair prompt that fixes it.

Engineering

Related tools

Tool

Test Case Prompt Generator

Build test generation prompts — unit, integration, or E2E — with framework modes and edge-case coverage rules.

Coding Workflows

Guides for this resource

Guide

Define Tool-Use Boundaries for an AI Agent

You give an AI agent tools — search, draft, send, refund — and say "use them as needed," so it drafts a reply and reports the refund as done without asking. Here's how to write tool-use boundaries: which tool for what, read vs write, and what needs a human before it fires.

Prompt Engineering

Guide

How to Evaluate an AI Agent With a Scorecard

Build an evaluation scorecard two people can apply to the same agent run and reach the same number — dimensions that don't measure each other, anchors you can point at, and the failures that shouldn't average out.

Prompt Engineering

Tip: Save time by exploring related resources and tools that integrate with this resource.