Prompt Engineering Data Extraction Null Handling

Missing Data in AI Extraction — Null, Unknown, or Skip

The most consequential setting in any extraction prompt: what the model does when the field isn't in the text. Four behaviors, and when each is right.

Overview

Every extraction eventually meets a text that doesn't contain the field — and what happens next is the difference between a reliable pipeline and silent garbage. Four contracts exist: Leave Empty (visible gap, human-friendly), Return Null (stable keys, pipeline-friendly), Return Unknown (loud gap a reviewer can't miss), and Skip Field (lean output, consumers must check key existence). This resource loads a sparse-by-nature lead-form extraction set to Return Unknown with strict ambiguity — the configuration where missing data is most visible — so you can see the contract language and swap behaviors to compare.

Workflow

  1. Match the behavior to the consumer

    Code parsing the output wants Return Null; a human scanning a sheet wants Leave Empty or Return Unknown; Skip Field needs existence-checking consumers.

  2. Swap behaviors and diff the prompt

    Load this setup, change Missing Data, and watch the MISSING DATA block rewrite itself — the contract is two lines, and they do all the work.

  3. Keep the never-invent rule

    Whatever behavior you choose, the constant is "never invent or guess a value for a missing field" — the line every behavior shares.

Why This Works

  • Naming the behavior in the prompt removes the model's single biggest improvisation point
  • "Unknown" exploits human attention: a literal string in a data column gets investigated
  • The never-invent rule survives every behavior choice — absence stays absence

Best for

  • Sparse sources — forms, short messages, partial documents
  • Teams debugging "where did this value come from?" incidents
  • Pipelines whose consumers disagree about null vs missing keys

Not for

  • Ambiguity handling — that's the neighboring policy; missing means absent, ambiguous means unclear
  • Format-level null semantics (JSON null vs empty XML element) — the engine adapts those automatically

Use cases

  • Choosing the right missing-data behavior before a pipeline ships
  • Making data gaps visible to human reviewers with "unknown"
  • Stopping models from inventing values for absent fields

Tip: Save time by exploring related resources and tools that integrate with this workflow.

Explore all resources