ARCHITECTURE

RAG architecture: from PDFs to production

Hybrid retrieval, re-ranking, chunking strategies, eval metrics that actually matter. The mistakes we made in our first five RAG deployments — so you don't have to.

AS Anjali Sharma · Head of ML 15 min read · 30 Mar 2026

The naive RAG that everyone ships first

Pick a vector DB. Chunk your docs at 512 tokens. Embed with whatever model is trending. Retrieve top-5. Stuff into the prompt. Hit a model. Ship.

It works for the demo. It fails the moment a user asks something that requires combining information from three different documents, or when the document set grows past 100k chunks, or when the user phrases the query in a way the embeddings don't match.

We've shipped 28 RAG systems in production. Here's what production-grade actually looks like.

Stage 1 · Document loading and parsing

The boring infrastructure that decides everything that follows. PDFs are the worst — they look fine to humans but contain tables that turn into garbled prose, footers that infect every chunk, and OCR'd images masquerading as text.

Stage 2 · Chunking — where most teams lose 20% accuracy

Naive token-window chunking (every 512 tokens) breaks paragraphs mid-sentence and loses context. Better strategies:

Stage 3 · Embeddings — pick once, change rarely

For English-only knowledge bases, BGE-large or E5-large are at the top of the MTEB leaderboard and run cheaply. For multilingual (English + Indic languages), multilingual-e5-large or BGE-M3. Match-of-language matters: an English query against a Hindi document set with the wrong embedder retrieves nothing.

One mistake we made early: changing embedders mid-project. The vectors aren't comparable across models. Re-embed everything or commit to one model and stay.

Stage 4 · Hybrid retrieval (vector + BM25)

Pure vector search misses queries with specific keywords (account numbers, product SKUs, function names). Pure BM25 misses paraphrases ("how do I cancel" vs "subscription termination"). Run both, fuse the scores with reciprocal rank fusion (RRF):

score(doc) = Σ 1 / (k + rank_i)

Where k=60 works well in practice. RRF rewards documents that appear high in any ranking. We see +8 to +14% recall@10 from hybrid vs pure vector on our customer benchmarks.

Stage 5 · Re-ranking — the highest-ROI optimization

After retrieving the top 50 candidates from hybrid search, run them through a cross-encoder (BGE-reranker-large is our default). Cross-encoders read the query and document together — much higher accuracy than dual-encoder embeddings, but too slow to run at scale. Re-ranking the top 50 → top 5 takes about 80ms on a single A100 and dramatically improves answer quality.

Skip the re-ranker only if you're optimizing for sub-200ms p95 first-token latency.

Stage 6 · Generation — prompt structure that works

The prompt has four sections, in this order:

  1. System role: describe what the assistant is and isn't allowed to do.
  2. Retrieved context: the top-k chunks, each with a citation marker.
  3. Response rules: "Cite the source for every claim. If the context doesn't answer the question, say 'I don't have that information.' Do not invent."
  4. The user query.

Putting context before rules helps models like Llama-3 stay grounded. Putting rules at the end keeps them in the model's most-recent attention.

Stage 7 · Evaluation — without it, you're flying blind

The eval set is the heart of any production RAG system. Build it on day one. Aim for 100+ Q&A pairs covering:

Score on retrieval accuracy (did the right chunk make top-5?), answer correctness (does the response match the expected answer?), citation quality (does every claim have a source?), and refusal correctness (does it know when to say I don't know?). RAGAS is a decent open-source toolkit for automated scoring.

Mistakes from our first 5 deployments

  1. Chunking PDFs blindly. The first version of one customer's index had headers and footers in every chunk — destroying retrieval until we filtered them out.
  2. No re-ranker. We thought it was optional. After A/B testing, it added 11% accuracy at 80ms cost. Always worth it.
  3. Forgetting metadata filters. Customers asking about a specific year, region, or product line need filtered retrieval. Store metadata, expose filter params in the API.
  4. Same chunk size for everything. Code, prose, and tabular data need different chunking. Treat them differently.
  5. Stale indexes. The first system we shipped had no re-indexing pipeline. Six months later, half the answers were based on outdated docs. Build the freshness pipeline on day one.

The production checklist


📞 Need RAG built right? →

Related: Build your private LLM in 14 days