Context Window Planning for RAG — Budget the Retrieved Docs
RAG context is a budget with line items: retrieved documents, the question, and the answer all share one window. Plan how many chunks actually fit.
Overview
Retrieval pipelines fail quietly when the retrieved context outgrows its budget share: documents get truncated, the model answers from half the evidence, and nobody changed any code. This scenario budgets a retrieved-document set against a large response reservation — the realistic RAG shape — and the breakdown answers the design question: with this chunk size, how many documents fit alongside the question and the reserved answer? On a million-token window the same retrieval set barely registers, which is itself a design input: chunk counts that strain one model are free on another.
Workflow
-
Paste a representative retrieval set
Real chunks at real sizes — the estimate scales to your top-k from there.
-
Reserve the answer honestly
RAG answers cite and synthesize — they are rarely small; budget Large.
-
Derive the chunk budget
Headroom divided by chunk size = the top-k the window actually supports.
Why This Works
- Line-item budgeting matches how RAG context is actually composed
- Quantified headroom converts directly into a top-k decision
- Cross-model comparison reframes chunk limits as a model choice
Best for
- RAG and retrieval pipeline builders
- Prompt stuffers deciding how many docs to include
- Capacity planning across candidate models
Not for
- Designing the retrieval ranking itself — this budgets what retrieval returns
- Formatting the retrieved documents with delimiters — that's the Long Input Formatter
Use cases
- Sizing top-k retrieval against the real window
- Diagnosing silently truncated retrieved context
- Choosing chunk sizes with budget arithmetic