Why your RAG retrieves garbage — fix your chunking

// how retrieval actually works

Retrieval-augmented generation (RAG) breaks your docs into chunks, turns each chunk into a vector (an embedding), and at query time pulls the handful of chunks most similar to the question — then hands only those to the model.

So the model can only be as good as the chunks you retrieved. If the right text never makes it into a retrievable chunk, no amount of prompting saves you. The chunk boundary is the whole ballgame.

Note: pure vector similarity is the baseline. Strong production pipelines also add keyword search (BM25) and a reranking step — see the fix below.

Three ways chunking goes wrong

1 · Chunks too big → a blurry average

Embeddings are made by averaging token vectors into one fixed-length vector. Cram five topics into one chunk and its embedding becomes a muddy average that matches no specific question sharply.

# one giant chunk = many topics, one vector chunk = "Billing... Refunds... Shipping... Login... Privacy..." # embedding ≈ the average of all of them → matches nothing well

As Weaviate puts it: oversized chunks "mix multiple ideas together… a noisy, 'averaged' embedding that doesn't clearly represent any single topic."

2 · Chunks too small → dangling references

Split mid-thought and a chunk can contain a pronoun whose antecedent lives in another chunk. Embedded alone, it's meaningless — and often won't be retrieved at all. This has a name: the anaphoric reference problem.

# chunk A "Berlin is the capital of Germany." # chunk B — embedded alone, "Its" points to nothing "Its population is 3.7 million."

3 · No overlap → the answer on a boundary

Fixed-size cuts fall in arbitrary places. If the answer straddles the boundary, neither neighbouring chunk holds the whole thing.

# blind fixed-size slice — splits sentences anywhere chunk = text[0:1000] # "...the API key goes in the" | "Authorization header."

The fix

do split on structure, not a character count

Cut on the document's own seams — headings, paragraphs, sentence and function boundaries — so a chunk is a coherent unit, not a random 1000-character window.

# recommended default: recursive, structure-aware splitter from langchain_text_splitters import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\n", ". ", " "], # paragraphs → lines → sentences chunk_size=700, chunk_overlap=100, # ~10–15% overlap so boundary answers survive ) chunks = splitter.split_text(doc)

Overlap is a sensible default, not a guaranteed win — and the optimal chunk size depends on your data and embedding model. Test it against your own retrieval metrics.

do keep one idea per chunk, with its context attached

The highest-leverage move: before embedding, prepend each chunk with the context it needs to stand alone — its section title, document, and a one-line summary. Anthropic calls this Contextual Retrieval.

# attach context so the chunk is self-contained chunk = { "section": "Refunds", "context": "From the 2024 returns policy. Refunds are issued to the original card.", "text": "Refunds are processed within five business days.", }

Anthropic's Contextual Retrieval cut failed retrievals by 35% (contextual embeddings), 49% (+ keyword/BM25), and up to 67% (+ reranking).

The one-line takeaway

Before you blame the model, look at your chunks. Split on structure, add a little overlap, and keep one idea per chunk with its context attached. Fix the chunks first — most "bad RAG" is a chunking bug.

Sources: Anthropic — Contextual Retrieval · Pinecone — Chunking strategies · Weaviate — Chunking strategies · LangChain — RecursiveCharacterTextSplitter

One concept a week. Free.

The deeper, copy-paste version of each ToolCall short — MCP, RAG, agents, caching — in your inbox.

// total: 0.00 · spam: void · unsubscribe: one click