Tutorials2026-05-13

RAG in Production: What Nobody Tells You

Hard-won lessons from deploying retrieval-augmented generation at scale.

By AI Signal Editorial

Retrieval-augmented generation looks deceptively simple in a notebook. You embed your documents, you store them in a vector database, you retrieve the top-k chunks for each query, you stuff them into a prompt, and you ship. In production, every one of those steps becomes its own quietly painful subsystem.

Chunking is product work, not infra

The default of "split by 512 tokens with 50-token overlap" is fine for a demo and almost never optimal for a real corpus. Section-aware chunking that respects document structure — headers, lists, tables — consistently outperforms flat splitters. For long-form technical content, anchor chunks to semantic units (a function, a definition, an example) and let chunk size float between 200 and 1,200 tokens. This is content modelling, not infrastructure, and it should sit close to the team that owns the source material.

Re-ranking is where quality actually lives

A bi-encoder retrieves fast but ranks crudely. Adding a cross-encoder re-ranker over the top 50 results — even a small one — typically lifts answer quality by more than any embedding-model swap. The cost is one extra GPU call per query at modest latency. If you are still serving raw vector-search hits to your LLM in 2026, you are leaving most of the quality on the table.

Evaluation will save you

The most important thing we built was not the pipeline but the eval harness: a few hundred curated query/answer pairs, a rubric for graded judging, and a CI job that fails the build if mean grade drops. Without that, every prompt tweak is a coin flip. With it, you can iterate confidently on chunking, embeddings, and re-ranking, and you can finally answer the question "is this change better" with something other than vibes.

// Example: a minimal RAG retrieval step
const hits = await vectorStore.search(queryEmbedding, { topK: 50 });
const reranked = await crossEncoder.rerank(query, hits);
const context = reranked.slice(0, 6).map(h => h.text).join("\n---\n");