Retrieval-augmented generation gets demoed in an afternoon and debugged for a quarter. Wiring an embedding model to a vector store and stuffing the top results into a prompt is the easy 80%. The hard part is making it answer correctly on the long tail of real questions, and then keeping it correct as the corpus, the models, and the prompts all change underneath you.
Chunking is a retrieval decision, not a preprocessing step
How you split documents determines what you can ever retrieve. Chunk too large and a single passage dilutes the signal with unrelated text, so the embedding sits between topics and matches nothing well. Chunk too small and you sever the context that made the passage meaningful — a number without its heading, a clause without its subject. We chunk on semantic boundaries rather than fixed token counts, keep a sentence or two of overlap so ideas are not guillotined mid-thought, and attach metadata — source, section, date — to every chunk so retrieval can filter as well as rank. The chunk is the unit of truth; if the answer is not cleanly inside one, no model will reliably assemble it.
Rerank what you retrieve
Vector similarity is a coarse first pass. It is fast and recall-oriented, good at pulling fifty plausible candidates and bad at knowing which five actually answer the question. So we retrieve wide and then rerank narrow: a cross-encoder scores each candidate against the query directly, reading both together instead of comparing pre-computed vectors. That second stage is where precision comes from. It is the difference between fifty documents that mention the topic and five that contain the answer, and it costs a few tens of milliseconds to get right.
If you cannot measure retrieval separately from generation, you cannot tell whether the model lied or the search simply failed to find the truth.— Protocore · AI engineering
Gate it with evals
Retrieval quality is measurable on its own, and we insist on measuring it before the generator ever sees a token. For a graded set of questions with known relevant chunks, we track recall — did the right chunk make it into the context — and ranking quality — how near the top it landed. These metrics become a gate in CI: a change to the chunker, the embedding model, or the index has to hold or improve them before it ships. Generation evals sit on top, but retrieval is measured first, because a generator fed the wrong context will produce a fluent, confident, wrong answer and take the blame for a failure that happened one stage upstream.
None of this is glamorous, and that is the point. Strong retrieval is mostly boundaries, metadata, a reranker, and a test harness that refuses to let any of them quietly regress. Get those four right and the language model has an easy job: summarize text that already contains the answer. Get them wrong and no amount of prompt engineering will save you, because you are asking the model to recall facts it was never actually shown.
Have a system to build?
Tell us the problem. We'll come back with an architecture and a plan.
Start a project