Evaluate AI Prompt Quality with Scores
Put numbers on prompt quality: eight scored dimensions — clarity, specificity, structure, output control, completeness, risk, efficiency, readiness.
Overview
"Is this a good prompt?" is easier to answer when you can measure it against an alternative. Quality decomposes into checkable parts: is the wording concrete or vague, does it control the output's shape and length, does it cover audience and context, does it contradict itself, does every word earn its tokens? Score a prompt against a baseline variant and the abstract question becomes eight specific ones. The loaded pair compares a typical mid-quality prompt against a strong one so you can calibrate what each score band looks like.
Workflow
-
Compare the calibration pair
Run the loaded example with Model Readiness focus. Note which dimensions separate the mid prompt from the strong one.
-
Score your own prompt
Paste your prompt as A and the strong example (or your own improved draft) as B to see where yours lands.
-
Read the gaps, not just the number
The Risks / Gaps list is the actionable part — each entry names a missing quality dimension in plain words.
-
Iterate and re-compare
Apply two or three suggestions, re-compare, and watch which dimensions move. That's the feedback loop.
Why This Works
- Decomposed scores turn 'be a better prompt writer' into specific, fixable habits
- Comparing against a reference prompt anchors the scores — a number only means something next to another number
- The same eight dimensions apply to every prompt type, so the skill transfers across tasks
Best for
- Anyone who wants a working definition of prompt quality, not folklore
- Reviewing prompts before they enter a shared library
- Diagnosing why a prompt underperforms by seeing which dimension drags it down
Not for
- Grading a single prompt in isolation — the comparator needs a second prompt as the reference point
- Output evaluation — this scores the instructions, not the model's answer
Use cases
- Benchmarking your everyday prompt against a deliberately strengthened version
- Calibrating what an 80+ output-control score actually looks like in practice
- Building intuition for which dimension your prompts habitually neglect