The "hello world" of RAG (Retrieval-Augmented Generation) takes about 20 minutes to build: chunk your documents, generate embeddings, store them in a vector database, and query with semantic search. The result works impressively in demos — until you deploy it to production and discover that it hallucinates confidently, retrieves irrelevant chunks, fails on multi-step questions, and gives different answers to the same question depending on how it's phrased.
The gap between tutorial RAG and production RAG is enormous. After building RAG systems for enterprise clients handling thousands of queries per day, we've identified the patterns that separate systems that work from systems that don't. This guide covers the hard problems that tutorials skip.
Chunking: The Foundation Everyone Gets Wrong
Chunking is how you split documents into pieces for embedding. The naive approach — split by character count every 500 characters — destroys context boundaries, splits sentences in half, and separates related information. Better chunking strategies make or break RAG quality.
Semantic chunking: Instead of splitting at fixed character counts, split at natural boundaries: paragraphs, sections, or topic changes. Use an LLM or sentence embedding model to detect topic shifts and chunk accordingly.
# Semantic chunking using sentence embeddings
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_chunk(text: str, max_chunk_size: int = 1500,
similarity_threshold: float = 0.75) -> list[str]:
"""Split text into semantically coherent chunks."""
sentences = text.split('. ')
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
current_embedding = embeddings[0]
for i in range(1, len(sentences)):
similarity = np.dot(current_embedding, embeddings[i]) / (
np.linalg.norm(current_embedding) * np.linalg.norm(embeddings[i])
)
current_text = '. '.join(current_chunk)
if similarity > similarity_threshold and len(current_text) < max_chunk_size:
current_chunk.append(sentences[i])
# Update running average embedding
current_embedding = np.mean(
[current_embedding, embeddings[i]], axis=0
)
else:
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentences[i]]
current_embedding = embeddings[i]
if current_chunk:
chunks.append('. '.join(current_chunk) + '.')
return chunks
Hierarchical chunking: Create chunks at multiple levels of granularity. Store both paragraph-level chunks and section-level chunks. Use small chunks for precise retrieval, then expand to the parent section for context when generating answers. This "small-to-big" strategy gives you the precision of small chunks with the context of large chunks.
Overlap: Always include 10-20% overlap between consecutive chunks. This ensures that information at chunk boundaries isn't lost. Without overlap, a question whose answer spans two chunks will get incomplete retrieval.
Hybrid Search: Combining Vector and Keyword Search
Pure semantic search has a critical weakness: it doesn't handle exact matches well. If a user asks "What is error code ERR-4523?", semantic search might return chunks about error handling in general rather than the specific chunk mentioning ERR-4523. Keyword search (BM25) excels at exact matches but fails at understanding intent.
Production RAG systems use hybrid search: run both a vector similarity search and a BM25 keyword search, then combine the results. This captures both semantic meaning and exact term matches.
# Hybrid search with Reciprocal Rank Fusion (RRF)
def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
# Semantic search
query_embedding = embed(query)
vector_results = vector_db.search(query_embedding, limit=top_k * 2)
# Keyword search (BM25)
keyword_results = bm25_index.search(query, limit=top_k * 2)
# Reciprocal Rank Fusion
scores = {}
k = 60 # RRF constant
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
# Sort by combined score and return top_k
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [get_document(doc_id) for doc_id, _ in ranked[:top_k]]
Re-ranking: Improving Retrieval Precision
Even with hybrid search, the top results aren't always the most relevant. Bi-encoder models (used for embedding) are fast but sacrifice accuracy because they encode the query and document independently. Cross-encoder models (re-rankers) are slower but more accurate because they process the query and document together, allowing for deeper semantic comparison.
The production pattern: retrieve 20-50 candidates with fast hybrid search, then re-rank the top candidates with a cross-encoder to get the best 3-5 results for the LLM. This two-stage approach combines the speed of bi-encoders with the accuracy of cross-encoders.
Query Transformation: Handling Complex Questions
Users don't always ask simple, well-formed questions. They ask multi-part questions ("What's our refund policy and how does it compare to last year's?"), vague questions ("How does the thing work?"), and follow-up questions that reference earlier context ("What about for enterprise customers?"). Passing these directly to retrieval gives poor results.
Query transformation techniques:
Query decomposition: Use an LLM to break complex questions into sub-questions, retrieve for each sub-question separately, and combine the results. "What's our refund policy and how does it compare to last year's?" becomes two queries: "current refund policy" and "refund policy changes 2025 vs 2026."
Hypothetical Document Embedding (HyDE): Instead of embedding the question directly, use an LLM to generate a hypothetical answer, embed that answer, and use it for retrieval. The intuition is that the hypothetical answer is semantically closer to the actual document than the question is.
Query expansion: Generate multiple phrasings of the same question and retrieve for all of them. "How to configure SMTP?" expands to: "SMTP server setup guide", "email sending configuration", "outbound mail server settings."
Evaluation: Measuring RAG Quality
You cannot improve what you cannot measure. RAG evaluation requires metrics at two levels: retrieval quality (are we finding the right documents?) and generation quality (is the answer correct, complete, and grounded in the retrieved context?).
Retrieval metrics: Use a test set of question-answer pairs with known relevant documents. Measure recall@k (what percentage of relevant documents appear in the top k results?) and MRR (Mean Reciprocal Rank — how high is the first relevant document?). Aim for recall@5 above 0.9.
Generation metrics: Faithfulness (does the answer only contain information from the retrieved context? Higher = less hallucination), Answer relevancy (does the answer address the question?), and Completeness (does the answer include all relevant information from the context?). Tools like RAGAS (RAG Assessment) automate these measurements.
Preventing Hallucinations in Production
The most dangerous failure mode in RAG is confident hallucination — the LLM generates an answer that sounds authoritative but is completely fabricated. This happens when the retrieved context doesn't contain the answer, but the LLM fills in the gap from its training data or invents something plausible.
Prevention strategies:
Explicit grounding instruction: Your system prompt must explicitly state: "Answer ONLY based on the provided context. If the context does not contain enough information to answer the question, say 'I don't have enough information to answer this question.' Never make up information."
Citation requirement: Require the LLM to cite which chunk(s) each claim comes from. "According to [Document: Refund Policy v2.3, Section 4], refunds are processed within 5-7 business days." If the LLM can't cite a source, it's likely hallucinating.
Confidence scoring: After generating an answer, use a separate LLM call to evaluate whether the answer is fully supported by the context. If confidence is below a threshold, either flag for human review or respond with "I'm not confident in this answer — please contact support."
Retrieval-gated responses: Before generating an answer, check the similarity scores of retrieved chunks. If the best match is below a threshold (e.g., cosine similarity < 0.75), the knowledge base probably doesn't contain the answer. Return a "no relevant information found" response instead of letting the LLM guess.
Architecture for Scale
A production RAG system serving 1,000+ queries per day needs: a vector database with low-latency search (Qdrant, Weaviate, Pinecone, or pgvector), a caching layer for frequent queries (Redis with query hash as key), an async ingestion pipeline for new documents (process, chunk, embed, and store without blocking queries), and monitoring for retrieval quality (log query-context-answer triples for offline evaluation).
ZeonEdge builds production RAG systems for enterprise knowledge bases, customer support automation, and internal documentation search. Schedule a consultation to discuss your RAG project.
Daniel Park
AI/ML Engineer focused on practical applications of machine learning in DevOps and cloud operations.