Building production RAG systems that work outside the demo

The first version of any RAG system works on a curated demo set. The user asks a question, the system returns relevant chunks, the model answers, everyone is impressed. Three weeks later it’s deployed to a real corpus, real users ask real questions, and the answers start drifting in ways that nobody saw in development.

We’ve shipped RAG systems against legal documents, technical manuals, internal knowledge bases, and clinical guidelines. Production RAG is fundamentally a retrieval engineering problem with a thin layer of LLM on top — and most teams over-invest in the LLM and under-invest in retrieval.

The chunking decision is doing 40% of the work

The chunking strategy is more important than the embedding model, the retrieval algorithm, and the LLM combined. A good embedding model on poorly chunked data performs worse than a mediocre embedding model on well-chunked data. We’ve measured this enough times to be confident.

Bad chunking looks like fixed-size windows over raw text — every 512 tokens, regardless of structural boundaries. The retrieval system pulls a chunk that starts mid-sentence, mentions the topic in passing, and gives the LLM half a context window of noise. The model dutifully assembles an answer that sounds plausible and is wrong in a specific way that’s hard to debug.

Good chunking respects the structure of the source. For markdown, chunk on heading boundaries. For PDFs, chunk on sections, and preserve the section title as a header in each chunk. For tables, keep the table together; don’t slice it. For code, chunk on function boundaries. The chunk should be the smallest unit that’s self-contained — a reader looking at the chunk in isolation should understand what it’s about.

A practical pattern we use: each chunk has a small header prepended to it that names the document, the section, and the immediate parent context. When the chunk is retrieved, the model sees not just the text but its location in the source. Costs nothing in latency, dramatically improves answer quality.

Hybrid retrieval is the default, not an optimization

Pure vector search dominates academic discussion of RAG and is rarely the right choice for production. The reason is mundane: vector search is great at semantic similarity and bad at lexical precision. A user searching for a specific error code, a specific person’s name, or a specific function reference wants exact-match. Vector search will pull semantically adjacent results that miss the literal string the user asked about.

Production RAG is hybrid: BM25 (or equivalent lexical search) for precision, vector search for recall, and a re-ranker on the merged result set. We use this stack on essentially every project now:

Query → [BM25 top-50] + [Vector top-50] → Dedup → Cross-encoder re-rank → Top-10 → LLM

The cross-encoder re-ranker is the part teams skip and shouldn’t. A small re-ranker model (Cohere Rerank, BGE Reranker, etc.) reorders 100 candidates against the query and surfaces the best 10. The latency cost is 100–300ms; the quality lift is significant. Without re-ranking, you’re either returning too few results (recall problem) or too many (precision problem). Re-ranking lets you cast a wide net cheaply and surface the best of it.

Metadata filtering is underrated

A user asking “what’s our refund policy” doesn’t want results from the marketing handbook, even if they’re semantically similar. They want results from the customer support knowledge base.

Metadata filtering — restricting search to a subset of the corpus before semantic similarity — solves this. Every chunk should carry metadata: source document, section type, audience, recency, owner. Queries should propagate intent into filters. “What’s our refund policy?” filters to support docs. “Who manages the integrations team?” filters to org charts. “What did we ship last quarter?” filters to internal updates with date > X.

The pattern that works: a small classifier in front of the retrieval call that maps the query to a metadata filter. This can be a model call (cheap one — Haiku or 4o-mini), a rules engine, or both in series. The filter narrows the search space by an order of magnitude before the embedding similarity does its work.

Evals are not optional

Most RAG teams ship without an eval harness, watch a few queries on launch day, and call it good. This works approximately as well as shipping any production system without tests.

A useful RAG eval has two layers.

Retrieval evaluation measures whether the right chunks come back for a given query. You need a labeled dataset of queries paired with the chunks that should be retrieved. Hand-labeling 100 queries takes a day and pays back forever. Standard metrics: precision@k, recall@k, MRR. You’re looking for trends over time as you change the system; absolute numbers matter less than direction.

Generation evaluation measures whether the LLM produces good answers given the retrieved context. The question to optimize for: given the retrieved chunks, did the model produce a faithful, complete answer? “Faithful” means it didn’t make things up that aren’t in the context. “Complete” means it didn’t omit relevant information. Both can be evaluated with another LLM as a judge — cheaper than human evaluation, more reproducible, and surprisingly reliable when you constrain the judge to specific criteria.

We run both layers on every change to the system. A change that improves retrieval but tanks generation is not a win. A change that doesn’t move the eval at all is not a win either; it’s deferred risk.

Citations are the safety mechanism

Hallucinations in RAG happen for two reasons. Either the right context wasn’t retrieved (retrieval failure), or it was retrieved and the model ignored it (generation failure). The mitigation that addresses both: require the model to cite which chunk each claim came from.

Practically, this means structuring the LLM call so that each retrieved chunk has an ID, and the model is asked to answer with inline citations like [chunk-3]. After generation, you verify every claim has a citation and every citation references a chunk that was actually in context. Claims without citations are flagged. Citations to nonexistent chunks are flagged.

This is not a rigorous proof against hallucination, but it does two useful things. It makes the model less likely to hallucinate (it knows it has to cite). And it surfaces, at runtime, when the model is making things up — which lets you handle those cases (refuse, add a disclaimer, escalate) rather than serving them confidently.

What scales linearly and what doesn’t

A theme: the components of a RAG system have very different scaling behaviors.

Linear with corpus size: storage cost, embedding generation cost, retrieval latency. These all grow predictably and can be planned for.

Super-linear with corpus size: noise in retrieval. As your corpus grows, the number of plausibly relevant chunks for a typical query grows faster than the number of actually relevant ones. A retrieval setup that works at 10k documents may not work at 1M. The fix is more aggressive metadata filtering and better re-ranking, not just bigger embeddings.

Sub-linear if done right: maintenance cost. Spending an extra week on retrieval evaluation up front means you can swap embedding models, change chunking strategies, and tune re-rankers without breaking the production system. Teams that skip this step pay it back later as fear-driven inability to change anything.

Where to start

If you’re building a RAG system today and want it to survive contact with real users:

Get your chunking right before you pick an embedding model. Walk a representative document and ask whether each chunk is self-contained.
Use hybrid retrieval (BM25 + vector + re-ranker) from day one. Don’t ship with pure vector and “optimize later.”
Build a 100-query labeled eval set before you ship. Run it on every change.
Require citations in the generation step. Verify them at runtime.
Ship a small slice — one document type, one user role, one set of intents — before generalizing. RAG breaks at the boundaries between intents; narrow ones are tractable.

If you’re scoping a RAG system or a knowledge product and want a second pass on the architecture, we do this work.

Production RAG: From Demo to a System That Doesn't Hallucinate

The chunking decision is doing 40% of the work

Hybrid retrieval is the default, not an optimization

Metadata filtering is underrated

Evals are not optional

Citations are the safety mechanism

What scales linearly and what doesn’t

Where to start

Bring us in for a 30-minute architecture call.

Related notes

AI for Enterprise Ops: Where the ROI Actually Lives

AI Evals: Beyond Vibes-Based QA

Cutting AI Costs by 10x Without Cutting Quality