Agent Safety & Refusal Evaluation Prompt
Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?
View Resource →Coding Workflows
"Write tests for this code" gets happy-path tests with weak assertions. Pick the test strategy, the framework, and the coverage areas — and get a test generation contract: failure scenarios, edge case groups, framework discipline, and non-goals that keep the AI from rewriting your code. Runs entirely in your browser.
What is being tested, and why now? E.g. "Generate tests for the orders REST endpoint before it goes public."
Each area contributes real testing instructions to the contract — not just a heading.
Paste the code under test and the prompt carries it; leave empty and the prompt ends with a paste-here placeholder.
Test the two failure directions — does the agent refuse what it must, and does it stay helpful on the benign requests it shouldn't over-refuse?
View Resource →Build the test set an agent has to pass — scenarios across the happy path, edges, and adversarial inputs, each paired with the expected behavior to grade against.
View Resource →Status codes, response shapes, 401 vs 403, idempotency — API tests that test the contract, not the implementation.
View Resource →Login, token refresh, and everything that must fail: expired tokens, wrong permissions, malicious credentials — auth tested as behavior.
View Resource →Cart to confirmation as a user would do it — with the failure scenarios real users actually trigger: refreshes, back buttons, double clicks.
View Resource →Null, empty, min, max, off-by-one, malformed, unicode — the systematic boundary hunt that finds bugs where they actually live.
View Resource →Turn real interactions into a labeled eval set — sample for coverage, label each with the expected behavior, and balance the set so the score means something.
View Resource →Real database, real services, real transactions — integration tests that verify round-trips and rollback, not mocked theater.
View Resource →getByRole over CSS chains, auto-wait over sleep, web-first assertions — Playwright tests written the way Playwright wants.
View Resource →Characterization tests for legacy code: assert what it does TODAY — bugs and all — so tomorrow's change can't lie about its impact.
View Resource →Explicit waits, stale-element handling, drivers that actually quit — Selenium regression tests that hold the legacy fort.
View Resource →Check that the agent calls tools right — the correct tool, valid arguments, the right time, and graceful handling when a tool fails or returns nothing.
View Resource →Mock the dependencies, test the business logic, one behavior per test — the unit testing contract that bans plumbing tests.
View Resource →Required fields one at a time, invalid formats, business rules at their exact boundaries — validation tested the way users break it.
View Resource →A complete AI-assisted review pass — not one prompt — that ends with ranked findings, tests guarding behavior, and a refactor plan when one is warranted.
View Playbook →Review code for what an attacker would do, not just what tests catch — anchor the model as a security engineer, run a threat-focused review, then back the findings with auth and input tests.
View Playbook →The order that actually finds bugs instead of guessing at them — so you end with a verified fix, not a plausible one that quietly returns next week.
View Playbook →Update old, risky code you didn't write — safely — by understanding and pinning its behavior in tests before you change a single line.
View Playbook →Restructure code you own without breaking it — change only what's worth changing, and prove with tests and a diff that behavior held.
View Playbook →Build a test suite that fails for real reasons, not green decoration — coverage across unit, integration, and edge cases, then a review for the gaps.
View Playbook →Speed up code that works but drags — find the actual hot path instead of guessing, understand why it's slow, optimize it, and prove with tests that you changed the speed and nothing else.
View Playbook →Find out whether an AI agent behaves before users do — define what correct means, build test scenarios with expected outputs, catch failures and hallucinations, then regression-test each version.
View Playbook →The full path from idea to a shipped SaaS MVP — define and scope the requirements, design the architecture, API, and data model, then build it reviewed, tested, secured, cost-controlled, and deployed.
View Blueprint →The full path to a support agent you can put in front of customers — write its instructions, ground it in your docs, route and handle tickets, then evaluate and cost-control it before it goes live.
View Blueprint →The full path to a backend you can put clients on — define the requirements, design the architecture, API contract, data model, and access control, then build it reviewed, tested, secured, and shipped.
View Blueprint →The full path to taming an inherited codebase — understand it, document its architecture, pin its behavior with tests, then refactor, modernize, review, speed up, and ship it without breaking what works.
View Blueprint →The full path to a retrieval system that returns grounded answers — understand the corpus, chunk and ground it, extract and classify the metadata, then evaluate that retrieval actually works.
View Blueprint →The full path to automation that survives the real world — wire the integrations and triggers, design the control API, move the data through validated stages, evaluate the AI steps, then deploy.
View Blueprint →The full path to a support operation, not just a bot — stand up the knowledge base, route the tickets, add the AI agent, integrate your stack, close the feedback loop, evaluate, and deploy.
View Blueprint →The full path to a two-sided platform — define the buyer-and-seller requirements, model the data, design the API, build roles and permissions, wire integrations, design the UI, then test, secure, and ship it.
View Blueprint →The full path to a store you own end to end — model the catalog and orders, design the storefront and checkout, add customer accounts and payments, then secure it, test it, and ship.
View Blueprint →State the testing objective, then pick the test strategy — Unit, Integration, End-to-End, Regression, Edge Case, or API. Each strategy is a different testing philosophy with its own principles and failure scenarios: unit tests isolate and mock, integration tests make the boundary the subject, regression tests pin today's behavior by name. Choose the framework as a mode — xUnit theories, Jest mock discipline, Playwright's no-sleep locator rules, PyTest fixtures — and toggle the coverage areas that apply: happy path, edge cases, error handling, validation, security, performance, regression. Each area contributes real testing instructions, and the live Coverage Preview shows exactly what your contract will demand. Set the depth (Production Ready means CI discipline and a regression net, not just more tests), optionally paste the code, and click Generate Test Prompt. The output is a test generation contract — including non-goals that stop the AI from rewriting your implementation. Nothing leaves your browser.
Different questions. Code Review asks "what is wrong with this code?" — it judges and reports findings. Test Case asks "how should this code be tested?" — it creates validation instructions. Review identifies the gaps; this tool builds the net. They meet in the middle: a review finding of "tests missing for the error path" is exactly what this tool's error-handling coverage area generates.
Because a framework is a serializer of testing intent, not a testing philosophy. The strategy (E2E, regression) decides WHAT to test; Selenium vs Playwright vs Cypress only changes HOW the instructions are phrased — waits, locators, lifecycle. One tool with framework modes beats four tools that are 90% identical — the same reason the JSON Output Prompt Builder absorbed XML and YAML.
Each strategy swaps the philosophy block and the failure scenarios. Unit tests get isolation rules and mocked-dependency failures; integration tests get transaction rollback and partial-failure scenarios; E2E gets session expiry and double-submission; API tests get the 401-vs-403 distinction and idempotency. Same skeleton, different testing worldview.
The NON-GOALS section is explicit: do not rewrite or "improve" the implementation, do not refactor for testability (flag it instead), do not invent requirements, and do not weaken assertions to make tests pass. The ASSUMPTIONS section forces the model to separate what it knows from what it guessed — and to list the GAPS it couldn't test.
Debug first — that's the Debugging Prompt Generator's job: "why is this failing?". This tool asks "how should behavior be validated?" and assumes the code's intended behavior is known. The one exception is regression strategy on legacy code: characterization tests deliberately pin current behavior, bugs and all, before you change anything.
It generates the test generation PROMPT — the contract you paste into ChatGPT, Claude, or your coding assistant along with the code. The value is repeatability: the same contract produces the same coverage discipline on every module, instead of whatever the model feels like testing today.