RAG architecture: from PDFs to production
Hybrid retrieval, re-ranking, chunking strategies, eval metrics that actually matter. The mistakes we made in our first five RAG deployments — so you don't have to.
The naive RAG that everyone ships first
Pick a vector DB. Chunk your docs at 512 tokens. Embed with whatever model is trending. Retrieve top-5. Stuff into the prompt. Hit a model. Ship.
It works for the demo. It fails the moment a user asks something that requires combining information from three different documents, or when the document set grows past 100k chunks, or when the user phrases the query in a way the embeddings don't match.
We've shipped 28 RAG systems in production. Here's what production-grade actually looks like.
Stage 1 · Document loading and parsing
The boring infrastructure that decides everything that follows. PDFs are the worst — they look fine to humans but contain tables that turn into garbled prose, footers that infect every chunk, and OCR'd images masquerading as text.
- PDFs: use a layout-aware parser (we use
unstructured+ Marker for hard cases). Detect tables, render to Markdown. Extract images and OCR them separately. - Word docs: easier —
python-docxpreserves structure. Style hints (Heading 1, Heading 2) are gold for hierarchical chunking. - HTML / wikis: strip nav and footer, keep the article body. Confluence and Notion have decent exports.
- Code: tree-sitter to parse by function/class boundaries instead of line counts. Each chunk should be syntactically meaningful.
Stage 2 · Chunking — where most teams lose 20% accuracy
Naive token-window chunking (every 512 tokens) breaks paragraphs mid-sentence and loses context. Better strategies:
- Semantic chunking: split on paragraph boundaries, then merge until you hit your target size. Preserve at least one full sentence per chunk.
- Hierarchical chunking: for long documents, store both full-section chunks (for summary queries) and paragraph-level chunks (for specific facts). Retrieve the right granularity for the question.
- Overlap matters: 50-100 token overlap between chunks lets us recover from cuts that happen mid-thought. Disable for code (overlap creates duplicate matches).
- Context windows: prepend each chunk with its document title and section heading. The embedder sees more context; retrieval improves.
Stage 3 · Embeddings — pick once, change rarely
For English-only knowledge bases, BGE-large or E5-large are at the top of the MTEB leaderboard and run cheaply. For multilingual (English + Indic languages), multilingual-e5-large or BGE-M3. Match-of-language matters: an English query against a Hindi document set with the wrong embedder retrieves nothing.
One mistake we made early: changing embedders mid-project. The vectors aren't comparable across models. Re-embed everything or commit to one model and stay.
Stage 4 · Hybrid retrieval (vector + BM25)
Pure vector search misses queries with specific keywords (account numbers, product SKUs, function names). Pure BM25 misses paraphrases ("how do I cancel" vs "subscription termination"). Run both, fuse the scores with reciprocal rank fusion (RRF):
score(doc) = Σ 1 / (k + rank_i)
Where k=60 works well in practice. RRF rewards documents that appear high in any ranking. We see +8 to +14% recall@10 from hybrid vs pure vector on our customer benchmarks.
Stage 5 · Re-ranking — the highest-ROI optimization
After retrieving the top 50 candidates from hybrid search, run them through a cross-encoder (BGE-reranker-large is our default). Cross-encoders read the query and document together — much higher accuracy than dual-encoder embeddings, but too slow to run at scale. Re-ranking the top 50 → top 5 takes about 80ms on a single A100 and dramatically improves answer quality.
Skip the re-ranker only if you're optimizing for sub-200ms p95 first-token latency.
Stage 6 · Generation — prompt structure that works
The prompt has four sections, in this order:
- System role: describe what the assistant is and isn't allowed to do.
- Retrieved context: the top-k chunks, each with a citation marker.
- Response rules: "Cite the source for every claim. If the context doesn't answer the question, say 'I don't have that information.' Do not invent."
- The user query.
Putting context before rules helps models like Llama-3 stay grounded. Putting rules at the end keeps them in the model's most-recent attention.
Stage 7 · Evaluation — without it, you're flying blind
The eval set is the heart of any production RAG system. Build it on day one. Aim for 100+ Q&A pairs covering:
- Single-document factual questions
- Cross-document synthesis questions
- Edge cases your data doesn't cover (the model should refuse)
- Adversarial prompts (jailbreaks, prompt injection)
Score on retrieval accuracy (did the right chunk make top-5?), answer correctness (does the response match the expected answer?), citation quality (does every claim have a source?), and refusal correctness (does it know when to say I don't know?). RAGAS is a decent open-source toolkit for automated scoring.
Mistakes from our first 5 deployments
- Chunking PDFs blindly. The first version of one customer's index had headers and footers in every chunk — destroying retrieval until we filtered them out.
- No re-ranker. We thought it was optional. After A/B testing, it added 11% accuracy at 80ms cost. Always worth it.
- Forgetting metadata filters. Customers asking about a specific year, region, or product line need filtered retrieval. Store metadata, expose filter params in the API.
- Same chunk size for everything. Code, prose, and tabular data need different chunking. Treat them differently.
- Stale indexes. The first system we shipped had no re-indexing pipeline. Six months later, half the answers were based on outdated docs. Build the freshness pipeline on day one.
The production checklist
- Hybrid retrieval (vector + BM25 + RRF)
- Cross-encoder re-ranking on top-k
- Metadata filters exposed in the API
- Citation markers in every response
- Eval set of 100+ Q&A pairs in CI
- Re-indexing pipeline with delta updates
- Drift monitoring on retrieval scores and response quality
- Audit log of every query (who, what, when, what was retrieved)
Related: Build your private LLM in 14 days