ARCHITECTURE

RAG architecture: from PDFs to production

Hybrid retrieval, re-ranking, chunking strategies, eval metrics that actually matter. The mistakes we made in our first five RAG deployments — so you don't have to.

AS Anjali Sharma · Head of ML 15 min read · 30 Mar 2026

The naive RAG that everyone ships first

Pick a vector DB. Chunk your docs at 512 tokens. Embed with whatever model is trending. Retrieve top-5. Stuff into the prompt. Hit a model. Ship.

It works for the demo. It fails the moment a user asks something that requires combining information from three different documents, or when the document set grows past 100k chunks, or when the user phrases the query in a way the embeddings don't match.

We've shipped 28 RAG systems in production. Here's what production-grade actually looks like.

Stage 1 · Document loading and parsing

The boring infrastructure that decides everything that follows. PDFs are the worst — they look fine to humans but contain tables that turn into garbled prose, footers that infect every chunk, and OCR'd images masquerading as text.

PDFs: use a layout-aware parser (we use unstructured + Marker for hard cases). Detect tables, render to Markdown. Extract images and OCR them separately.
Word docs: easier — python-docx preserves structure. Style hints (Heading 1, Heading 2) are gold for hierarchical chunking.
HTML / wikis: strip nav and footer, keep the article body. Confluence and Notion have decent exports.
Code: tree-sitter to parse by function/class boundaries instead of line counts. Each chunk should be syntactically meaningful.

Stage 2 · Chunking — where most teams lose 20% accuracy

Naive token-window chunking (every 512 tokens) breaks paragraphs mid-sentence and loses context. Better strategies:

Semantic chunking: split on paragraph boundaries, then merge until you hit your target size. Preserve at least one full sentence per chunk.
Hierarchical chunking: for long documents, store both full-section chunks (for summary queries) and paragraph-level chunks (for specific facts). Retrieve the right granularity for the question.
Overlap matters: 50-100 token overlap between chunks lets us recover from cuts that happen mid-thought. Disable for code (overlap creates duplicate matches).
Context windows: prepend each chunk with its document title and section heading. The embedder sees more context; retrieval improves.

Stage 3 · Embeddings — pick once, change rarely

For English-only knowledge bases, BGE-large or E5-large are at the top of the MTEB leaderboard and run cheaply. For multilingual (English + Indic languages), multilingual-e5-large or BGE-M3. Match-of-language matters: an English query against a Hindi document set with the wrong embedder retrieves nothing.

One mistake we made early: changing embedders mid-project. The vectors aren't comparable across models. Re-embed everything or commit to one model and stay.

Stage 4 · Hybrid retrieval (vector + BM25)

Pure vector search misses queries with specific keywords (account numbers, product SKUs, function names). Pure BM25 misses paraphrases ("how do I cancel" vs "subscription termination"). Run both, fuse the scores with reciprocal rank fusion (RRF):

score(doc) = Σ 1 / (k + rank_i)

Where k=60 works well in practice. RRF rewards documents that appear high in any ranking. We see +8 to +14% recall@10 from hybrid vs pure vector on our customer benchmarks.

Stage 5 · Re-ranking — the highest-ROI optimization

After retrieving the top 50 candidates from hybrid search, run them through a cross-encoder (BGE-reranker-large is our default). Cross-encoders read the query and document together — much higher accuracy than dual-encoder embeddings, but too slow to run at scale. Re-ranking the top 50 → top 5 takes about 80ms on a single A100 and dramatically improves answer quality.

Skip the re-ranker only if you're optimizing for sub-200ms p95 first-token latency.

Stage 6 · Generation — prompt structure that works

The prompt has four sections, in this order:

System role: describe what the assistant is and isn't allowed to do.
Retrieved context: the top-k chunks, each with a citation marker.
Response rules: "Cite the source for every claim. If the context doesn't answer the question, say 'I don't have that information.' Do not invent."
The user query.

Putting context before rules helps models like Llama-3 stay grounded. Putting rules at the end keeps them in the model's most-recent attention.

Stage 7 · Evaluation — without it, you're flying blind

The eval set is the heart of any production RAG system. Build it on day one. Aim for 100+ Q&A pairs covering:

Single-document factual questions
Cross-document synthesis questions
Edge cases your data doesn't cover (the model should refuse)
Adversarial prompts (jailbreaks, prompt injection)

Score on retrieval accuracy (did the right chunk make top-5?), answer correctness (does the response match the expected answer?), citation quality (does every claim have a source?), and refusal correctness (does it know when to say I don't know?). RAGAS is a decent open-source toolkit for automated scoring.

Mistakes from our first 5 deployments

Chunking PDFs blindly. The first version of one customer's index had headers and footers in every chunk — destroying retrieval until we filtered them out.
No re-ranker. We thought it was optional. After A/B testing, it added 11% accuracy at 80ms cost. Always worth it.
Forgetting metadata filters. Customers asking about a specific year, region, or product line need filtered retrieval. Store metadata, expose filter params in the API.
Same chunk size for everything. Code, prose, and tabular data need different chunking. Treat them differently.
Stale indexes. The first system we shipped had no re-indexing pipeline. Six months later, half the answers were based on outdated docs. Build the freshness pipeline on day one.

The production checklist

Hybrid retrieval (vector + BM25 + RRF)
Cross-encoder re-ranking on top-k
Metadata filters exposed in the API
Citation markers in every response
Eval set of 100+ Q&A pairs in CI
Re-indexing pipeline with delta updates
Drift monitoring on retrieval scores and response quality
Audit log of every query (who, what, when, what was retrieved)

📞 Need RAG built right? →