Prompt A/B Testing, Before You Run Anything
A/B test prompts on paper first: score both variants on output control and clarity, fix the loser's gaps, then spend your runs on a fair fight.
Overview
Real prompt A/B testing — run both variants many times, rate the outputs — is expensive, and most of that expense is wasted when one variant was structurally weaker from the start. A paper round first saves the budget: score both prompts, surface the gaps, and either fix the weak variant or skip the runtime test entirely because the difference is already decisive. This resource loads two ad-copy variants so you can run that paper round and see what a fair A/B pair looks like.
Workflow
-
Score both variants first
Paste variant A and B. The loaded example pairs a controlled variant with a vague one — note the output-control gap.
-
Fix the structural loser
Apply the improvement suggestions to the weaker variant. An A/B test against a structurally broken variant proves nothing.
-
Re-compare until it's close
When the paper scores are within a few points, you have a fair test — now the runtime outputs will tell you something real.
-
Run the live test
Run both prompts on the same inputs and rate outputs. Keep the paper report as the record of what differed going in.
Why This Works
- Most A/B 'winners' just had more output control going in — the paper round catches that before you pay for runs
- Fixing the weak variant first turns the live test into a real experiment instead of a foregone conclusion
- Scores create an audit trail: you can show why a variant entered the test and why it won
Best for
- Workflows where prompt runs cost real money or reviewer time
- Teams that promote prompts into production templates after testing
- Variants that differ in structure, not just a word or two
Not for
- Replacing output evaluation entirely — structure predicts quality, it doesn't guarantee it
- Variants that differ only by one synonym — there is nothing structural to compare
Use cases
- Pre-screening prompt variants before a paid evaluation run
- Making sure an A/B test compares two real alternatives, not a strong prompt against a stub
- Documenting why a variant was promoted, with scores instead of vibes