// rag · rag

RAG that actually retrieves the right thing

$ Jakub Jirák · Jun 12 · 10 min

Retrieval-augmented generation has a reputation problem: people blame the model when the real failure is upstream. If the retriever hands the model the wrong three chunks, no amount of prompt-tuning saves the answer. RAG is a search problem wearing an LLM costume, and search is the part teams under-invest in.

The pipeline, and where it breaks

# the RAG pipeline
ingest   -> chunk documents, embed, index
retrieve -> embed the query, find top-k chunks
rerank   -> reorder by a sharper relevance signal   (often skipped — a mistake)
generate -> stuff chunks into the prompt, answer with citations

Almost every quality problem lives in chunk and retrieve. The model is the reliable part.

Chunking is a decision, not a default

The naive move — split every document into 500-token windows — is where most RAG quality dies. A chunk should be a self-contained unit of meaning, because that's what gets retrieved and read in isolation.

Respect structure. Split on headings, function boundaries, table rows — not arbitrary character counts that guillotine a sentence or orphan a code block from its signature.
Keep chunks small but whole. Too big and one chunk dilutes the embedding with three topics; too small and it loses the context that makes it answerable.
Carry metadata. Source, section title, date, author. Half of "retrieve the right thing" is filtering by metadata before you ever compute a similarity.

If you change one thing in a struggling RAG system, change the chunking. It's the highest-leverage and most-ignored knob.

Embeddings feel objective. They aren't

Cosine similarity returns a confident number for everything, which is exactly the trap — it will happily rank a plausibly-worded-but-wrong chunk above the dry, correct one (the embeddings post on the homepage grid riffs on this). Two cheap, high-impact fixes:

Hybrid search. Combine dense vectors with old-fashioned keyword/BM25 search. Keywords catch the exact identifiers, error codes, and product names that embeddings smear together. The union beats either alone, reliably.
Rerank. Pull a generous top-20 with the fast vector search, then reorder with a cross-encoder reranker that actually reads query-and-chunk together. This single step is the biggest quality jump per line of code in most RAG systems, and it's the step most people skip.

You can't fix what you don't measure separately

The discipline that separates working RAG from vibes-RAG: evaluate the retriever on its own, before the generator ever runs.

Build a small set of real questions with the known-correct source chunks.
Measure retrieval directly: is the right chunk in the top-k? (recall@k, MRR). If the answer is no, the generator was never going to succeed — stop tuning prompts.
Only once retrieval is solid, evaluate end-to-end answer quality.

This split tells you which half is broken, which is the difference between fixing RAG in an afternoon and flailing for a sprint.

Do you even need RAG?

The laziest, most important question. RAG exists to inject knowledge that changes and is too big for the context window. If your corpus is small enough to fit in a long-context model's window, skip the vector database entirely and just put the documents in the prompt — no chunking, no embedding drift, no reranker, no index to keep fresh. A million-token window deletes a whole class of RAG infrastructure for small-to-medium corpora.

Reach for real RAG when the corpus is genuinely large, changes often, or you need source citations and freshness control. Otherwise it's infrastructure you built six months early.

The lazy, correct RAG stack

A plain document store with hybrid (vector + keyword) search. No exotic vector DB until search quality demands it.
Structure-aware chunking with metadata. This is where your effort goes.
A reranker on the top-k. Non-negotiable; cheapest quality win available.
A 20-question retrieval eval you run on every change.
Citations back to source, so wrong answers are auditable.

That's a real RAG system. It fits on one page because the hard part was never the model — it was deciding what the model gets to read. The same principle runs through every agent architecture post here.

#rag#retrieval#architecture