Missing Data in AI Extraction — Null, Unknown, or Skip
The most consequential setting in any extraction prompt: what the model does when the field isn't in the text. Four behaviors, and when each is right.
Overview
Every extraction eventually meets a text that doesn't contain the field — and what happens next is the difference between a reliable pipeline and silent garbage. Four contracts exist: Leave Empty (visible gap, human-friendly), Return Null (stable keys, pipeline-friendly), Return Unknown (loud gap a reviewer can't miss), and Skip Field (lean output, consumers must check key existence). This resource loads a sparse-by-nature lead-form extraction set to Return Unknown with strict ambiguity — the configuration where missing data is most visible — so you can see the contract language and swap behaviors to compare.
Workflow
-
Match the behavior to the consumer
Code parsing the output wants Return Null; a human scanning a sheet wants Leave Empty or Return Unknown; Skip Field needs existence-checking consumers.
-
Swap behaviors and diff the prompt
Load this setup, change Missing Data, and watch the MISSING DATA block rewrite itself — the contract is two lines, and they do all the work.
-
Keep the never-invent rule
Whatever behavior you choose, the constant is "never invent or guess a value for a missing field" — the line every behavior shares.
Why This Works
- Naming the behavior in the prompt removes the model's single biggest improvisation point
- "Unknown" exploits human attention: a literal string in a data column gets investigated
- The never-invent rule survives every behavior choice — absence stays absence
Best for
- Sparse sources — forms, short messages, partial documents
- Teams debugging "where did this value come from?" incidents
- Pipelines whose consumers disagree about null vs missing keys
Not for
- Ambiguity handling — that's the neighboring policy; missing means absent, ambiguous means unclear
- Format-level null semantics (JSON null vs empty XML element) — the engine adapts those automatically
Use cases
- Choosing the right missing-data behavior before a pipeline ships
- Making data gaps visible to human reviewers with "unknown"
- Stopping models from inventing values for absent fields