Jump to section
Last verified: May 16, 2026. Vendor pricing and benchmarks refreshed quarterly.
Retrieval-augmented generation is a pattern that adds a retrieval step before generation so the language model can cite documents it was never trained on. The quality of that retrieval, not the quality of the model, determines whether the answer is right. That sentence does more work than most RAG explainers let it.
RAG exists because language models have a knowledge cutoff. Training data ends at a fixed date. Private company documents never appear in training at all. When you ask a base LLM about your Q3 pricing memo or yesterday’s product update, it either confabulates or refuses. RAG fixes this by fetching the relevant content at query time and injecting it into the prompt.
The original 2020 paper by Patrick Lewis at Meta AI showed that a model grounded in retrieved documents was six times more likely to generate a factual answer than a model relying on its training weights alone (42.7% vs 7.1%).
RAG reduces hallucination. It does not eliminate it. Stanford researchers found that legal AI tools using RAG still hallucinate in 17 to 33 percent of cases.
A RAG pipeline has three phases: retrieve relevant chunks from an indexed corpus, augment the LLM prompt with those chunks, then generate an answer grounded in that context. The pipeline is straightforward to describe. It is not straightforward to build well.
One thing to orient you before the technical sections: the question of whether 1M-token context windows make RAG obsolete is real. For small, static corpora, they can. For large, dynamic knowledge bases, RAG still wins on cost, latency, and scalability. The section on when RAG is the wrong fix addresses exactly where that line sits.
What RAG Actually Is (in Plain English)
I have built and broken RAG systems. The thing that breaks is retrieval, not generation.
That is the insight vendor pages leave out, because vendors selling you a vector database or a cloud AI service do not benefit from telling you the hard part is not the LLM.
Here is the plain-English version: RAG is what happens when you let the model look up an answer before writing it, instead of writing from memory. The model’s training data is its parametric memory, baked into the weights. RAG gives it non-parametric memory: documents it can access at query time, outside the weights, drawn from your own corpus.
To understand why this matters, it helps to understand what language models know (and don’t know). Everything in parametric memory is frozen at training cutoff. Ask a base LLM about a policy that changed last month and it answers from what it learned before that cutoff, with full confidence. The retrieval layer is what gives the model access to anything newer or more private than its training data.
Patrick Lewis and colleagues at Meta AI introduced this approach in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (arXiv:2005.11401, NeurIPS 2020). The result that mattered: RAG was six times more likely to produce a factual answer than BART alone. The framing was academic question answering. The pattern generalized to everything.
The “R” in RAG is doing more work than most explanations suggest. Retrieval is not a step. It is a pipeline.
How RAG Works Mechanically
The pipeline has two distinct phases that most explainers collapse into one. Keeping them separate matters for debugging and for cost accounting.
Offline phase (build time). Your source documents are split into chunks. Each chunk is converted to a dense vector by an embedding model. Those vectors are stored in a vector database. Pinecone, Weaviate, Qdrant, and pgvector are the common choices. This phase runs once, and again whenever your corpus updates. The quality of your chunking and the quality of your embedding model set the ceiling on everything that follows.
Online phase (query time). The user’s question is embedded using the same model used during ingestion. The vector database runs an approximate nearest neighbor (ANN) search and returns the top-k chunks ranked by cosine or dot-product similarity. Those chunks are inserted into the LLM prompt as a context block. The LLM generates a response grounded in what was retrieved.
Typical numbers: vector search runs in 20 to 100ms. If you add a reranker (a second-stage precision pass covered in the final section), add another 120ms. The total round trip is fast enough for conversational applications.
How the context block is positioned in the prompt matters. Prompt structure for grounded generation affects how well the model uses the retrieved content rather than reverting to its training weights.
Here is the part worth holding onto: the LLM call is often the simplest component to get right. If the wrong context goes in, a fluent and confident wrong answer comes out. The failure happens at retrieval, not at generation. If you put bad chunks in, GPT-4 will write you a very well-structured wrong answer. The model cannot fix retrieval errors.
What RAG Fixes (and What It Doesn’t)
RAG solves three specific problems, and it does not fully solve the problem most people think it solves.
What RAG actually fixes:
Knowledge cutoff. Training data ends at a fixed date. RAG provides current or private knowledge without retraining the model. Updating a vector index when knowledge changes is orders of magnitude cheaper than updating model weights.
Private enterprise knowledge. A model trained on public data has never seen your contracts, your product specs, or your internal playbooks. RAG makes that material accessible at query time.
Verifiability. When the system retrieves a chunk from document X, you can tell the user exactly where the answer came from. That traceability is not possible with pure parametric generation.
What RAG does not fix:
Hallucination is reduced, not eliminated. The Stanford 17-33% legal RAG finding is the anchor data point here. If the retrieval step returns the wrong documents, the LLM still generates a fluent, confident answer from that wrong context. Wrong inputs produce well-written wrong outputs.
Two failure paths remain after you add RAG. First: retrieval misses the right document entirely. The answer was in the corpus, but the retrieval step did not surface it. Second: the model ignores or misreads the retrieved context in favor of what it learned during training, a pattern called parametric override. Both produce wrong answers.
The RAGAS evaluation framework defines faithfulness as whether the generated answer is supported by the retrieved context. A high faithfulness score confirms the model stayed within retrieved content, but it does not guarantee factual correctness if the retrieved document was itself wrong or outdated.
Understanding why LLMs hallucinate even when given correct context is a different problem than hallucination from missing information, and both can occur in a RAG system.
Chunking Is the Hidden Production Problem
When I tell people that the highest-leverage decision in a RAG system is chunking, I get blank stares. Everyone wants to talk about which vector database to use. Nobody wants to talk about where you cut the document.
Chunking is how you split source documents into the units stored in the vector index. If a relevant piece of information straddles two chunk boundaries, neither chunk retrieves with high confidence. This is one of the most common silent failure modes in production RAG. The answer was in the corpus. It was just split across two chunks that neither retrieved well.
Fixed-size chunking. Split every N tokens with overlap. Fast and deterministic. No embedding calls during ingestion. The downside is that cuts fall across sentence and paragraph boundaries without any awareness of document structure. The counterintuitive finding: NAACL 2025 research found that fixed 200-word chunks matched or outperformed semantic chunking across multiple retrieval and generation benchmarks. The added ingestion cost of fancier approaches is not automatically justified.
Recursive character splitting. Applies a hierarchy of separators (paragraph breaks first, then line breaks, then spaces) and recursively splits chunks that exceed the target size. This preserves document structure better than fixed-size chunking. It is the default starting point. Recommended configuration: 400 to 512 tokens with 10 to 20 percent overlap.
Semantic chunking. Groups sentences by embedding similarity, placing chunk boundaries where similarity between adjacent sentences drops below a threshold. The problem: it requires embedding API calls during ingestion, which is slow and expensive at scale. And as the NAACL 2025 finding shows, performance gains over recursive splitting are inconsistent enough that you should measure before committing to the cost.
Structural chunking. For documents with explicit organization: legal filings, API documentation, financial reports. Chunk at heading, section, and table boundaries. Only split sections that exceed the maximum size. When the document’s own structure is semantically coherent, preserve it.
My practical recommendation: start with recursive character splitting at 400 to 512 tokens with 10 to 20 percent overlap. Move to structural chunking for well-structured document types. Add semantic chunking only if you measure retrieval recall and it is genuinely underperforming. The ingestion cost is real, and the returns are not guaranteed.
One thing vendor explainers leave out about chunking: it is a decision you make at ingestion time, and changing it means re-indexing everything. Get it roughly right early.
Embedding Model Choice Matters More Than Vector Database Choice
I have watched teams spend two weeks comparing Pinecone vs. Weaviate. That time would have been more productively spent on embedding model selection, which has a larger effect on retrieval quality.
Here is why: the embedding model quality sets a ceiling on retrieval recall. No reranker, no hybrid search strategy, and no prompt engineering can recover a relevant document that the embedding model placed far from the query in vector space. If the model encoded the document and the query such that they do not appear close to each other, the document will not be retrieved. Full stop.
The distinction that matters most, and that most operators miss: use a model trained for asymmetric search (short query vs. long document passage), not general semantic similarity. A model trained to find sentences that mean the same thing is doing a different job than a model trained to find documents that answer a question. They are not interchangeable.
Three models worth knowing:
OpenAI text-embedding-3-large: 3,072 dimensions. MTEB v2 score around 64.6. Wide enterprise deployment. The safe default for teams already in the OpenAI stack. The higher dimensionality means two to three times the storage of 1,536-dimension models, which matters at 100M+ document scale.
Voyage AI voyage-3-large / v4-large: Purpose-built for retrieval, not general similarity. Leads retrieval-specific MTEB benchmarks at around 67.1. This is what we run at AIM.
Cohere embed-v4: MTEB around 66.3. Distinguishes “search_document” and “search_query” input types natively, which reflects an accurate understanding of the asymmetric nature of RAG retrieval.
The operator-level conclusion: the choice between these three matters less than (a) using any model trained for asymmetric search, and (b) not mixing models. A corpus embedded with model A cannot be reliably searched with model B. Embedding model lock-in is real. Changing your embedding model after indexing means re-embedding the entire corpus from scratch. That is not a small operation.
On vector databases: at 1M vectors and typical RAG query patterns, Pinecone, Weaviate, Qdrant, and pgvector all work fine. The performance differences at that scale are negligible. The decision is mostly about your existing stack. Teams running Postgres should look at pgvector first. Teams that want no infrastructure to manage should look at Pinecone or Weaviate Cloud. Do not spend weeks on this decision. Spend that time on chunking and embedding model selection.
When RAG Is the Wrong Fix
This is the section vendor pages skip, because vendors do not benefit from telling you not to build the thing they sell.
RAG is not the right architecture for every knowledge access problem. Here are the cases where a different approach is better.
Small corpus. If your knowledge base fits in the context window, long-context injection is simpler and often more accurate. You skip the retrieval infrastructure entirely and let the LLM attend to the full corpus directly. The calculus changes depending on which model you are using, and context window differences across major models affect what fits. Gemini 1.5 and 2.0 offer 1M-token contexts. Claude and GPT-5 offer 200K+. For a corpus under roughly 500K tokens, RAG adds infrastructure with no benefit over stuffing the context.
Real-time or dynamic data. Stock prices, live inventory, sensor readings, weather. Vector indexes are not designed for real-time updates. For these cases, Model Context Protocol (MCP) is the better pattern: the LLM calls an API directly rather than searching a vector index. RAG answers from stale indexes. When data changes faster than you can re-index, RAG answers with false confidence.
High query volume at large context. At thousands of queries per day with a 1M-token context window, long-context injection costs 100 to 1,000 times more per request than RAG. At low query volumes, context stuffing can be simpler and economically fine. At scale, RAG wins on cost.
Behavioral or style adaptation. RAG supplies knowledge. Fine-tuning supplies behavioral patterns, tone, and task specialization. If you want the model to write like your brand, answer in a specific format, or specialize in a narrow task domain, fine-tuning is the right tool. If you want the model to know your documents, RAG is the right tool. They are often used together: a fine-tuned model with a RAG retrieval layer.
Overengineered simplicity. A FAQ with 50 questions is better served by semantic search over a small index than by a full RAG pipeline. Overengineering is a real failure mode.
There is also the “lost in the middle” effect to consider with long-context injection: LLMs systematically underweight information positioned in the middle of long contexts. That limits the reliability of naive context stuffing at scale, even when the corpus fits the window.
Teams that account only for the obvious LLM API costs typically underestimate total RAG infrastructure cost by two to three times. The non-obvious sources (reranking model calls, re-indexing jobs, failed retrieval retry costs, monitoring infrastructure) account for 60 to 70 percent of total cost in most production deployments.
How We Use RAG at AIM
At Alameda Internet Marketing, we run RAG over client marketing documentation: product specs, historical campaigns, brand guidelines, positioning documents. The purpose is to let the LLM produce on-brand content that reflects the client’s actual positioning rather than generic training data. The model does not know our clients’ voices. The retrieval layer teaches it.
Our stack: Voyage v4-large for embedding (200M free tokens on our current plan), hybrid search combining BM25 and dense retrieval, and Voyage rerank-2.5 for the precision pass. The retrieval pattern is retrieve top-50 with hybrid search, rerank to top-5 or top-10, then pass that set to the LLM. Adding rerank-2.5 was one of the higher-return changes we made to the system. We had spent considerable time on model selection and prompt tuning before adding it. The reranker produced a larger quality improvement than most of that earlier work.
For evaluation, we use RAGAS. Not “does the answer feel right,” but measurement: faithfulness (does the answer stay within the retrieved context?), context precision (are the most relevant chunks ranked first?), and context recall (was all necessary information present in the retrieved chunks?). RAGAS is reference-free, which means it does not require annotated ground truth to run. That makes it practical for production monitoring, not just development benchmarking.
The hardest production issue we hit was stale indexing. Documents were updating in the source system faster than our re-indexing jobs were running. The vector index was perpetually a few days behind. The LLM was generating confidently from outdated context. The answers looked right. The content was wrong. The fix was incremental re-indexing triggered on document updates, with index freshness monitoring keyed to expected update cadence.
If you are working through whether RAG fits your use case, that is something I help clients think through as part of our AI consulting work.
Practical RAG Decisions (the Choices That Actually Matter)
Hybrid search. Dense retrieval (vector similarity) fails on lexically specific queries: product SKUs, error codes, proper names, API method names. If the exact string is not well-represented in the embedding model’s training distribution, the resulting embedding is unreliable. BM25 (sparse retrieval) fails on semantic queries: it has no understanding of meaning, only of token overlap. Run both in parallel and fuse the results.
The standard fusion algorithm is Reciprocal Rank Fusion (RRF): each candidate scores 1/(k + rank) from each retrieval list, where k equals 60 by convention. The advantage is that RRF operates on rank positions rather than raw scores, which sidesteps the incompatible scale problem (BM25 scores and cosine similarities cannot be directly summed). Weaviate has native hybrid search. Qdrant supports it via sparse vector indexes. Hybrid search is not optional in production systems that handle diverse query types.
Reranking. Initial retrieval is optimized for speed and recall: get a broad set of candidates fast. It is not optimized for precision. A bi-encoder compresses each document into a single vector independently of the query, which necessarily loses cross-attention information. A cross-encoder reranker takes each (query, candidate document) pair and scores them together, with the query and document going through the transformer simultaneously. Slower, but more accurate.
The mature stack pulls a wide candidate set with hybrid search, then narrows to the top-5 or top-10 with a cross-encoder rerank before the LLM sees anything. Most teams add the reranker last and find it produces a larger quality jump than the model and prompt tuning they did beforehand. The latency hit lands around 120ms. Worth it. Cohere rerank-3 and Voyage rerank-2.5 are the current API leaders.
RAGAS for evaluation. If you are not measuring, you are guessing. Faithfulness tells you whether the generated answer is grounded in retrieved context. Context precision tells you whether the most relevant chunks were ranked first. Context recall tells you whether all necessary information was present in what was retrieved. RAGAS does not require ground truth annotations, which makes it practical for ongoing production monitoring rather than one-time benchmarking.
GraphRAG and beyond. Microsoft’s GraphRAG (arXiv:2404.16130) builds a knowledge graph from the corpus and pre-generates community summaries, outperforming standard RAG on global sensemaking questions at higher indexing cost. A 2026 Nature Communications paper on Hyper-RAG uses hypergraphs to capture beyond-pairwise entity relationships, with strong results in clinical and legal domains where multiple entities interact in complex ways. HopRAG (ACL 2025 Findings, arXiv:2502.12442) addresses multi-hop retrieval, where answering a question requires synthesizing information from multiple independent documents through iterative retrieval chains. These are production-viable for the right use cases. None replace standard RAG for typical enterprise knowledge base applications.
There is no “done” state in a RAG system. Index freshness, retrieval quality monitoring, and chunk boundary tuning are ongoing. Build the measurement infrastructure before you build the retrieval pipeline.
Frequently Asked Questions
Does ChatGPT use RAG?
ChatGPT uses retrieval when its web browsing tool is active, but the base model answers from training weights by default. Browsing is invoked when the user requests current information or the model determines it is needed. Operators building applications on the GPT-4 or GPT-5 API implement RAG themselves at the application layer. It is an architectural choice made by the application builder, not a feature baked into the model.
What is the difference between RAG and fine-tuning?
Fine-tuning modifies model weights using training examples, changing what the model knows and how it behaves. RAG leaves the model unchanged and supplies external knowledge at query time. Use fine-tuning for behavioral adaptation (tone, format, task specialization) and stable domain knowledge that does not need to be traceable to a specific source document. Use RAG for current, private, or frequently changing knowledge where verifiability matters. The two approaches are often combined: a fine-tuned model with a RAG retrieval layer on top.
Do I still need RAG with 1M context windows?
For small, static corpora under roughly 500K tokens, long-context injection avoids the retrieval infrastructure entirely and can be more accurate. RAG wins when the corpus exceeds the context window, when query volume makes per-request long-context costs prohibitive (the difference at thousands of queries per day can reach 100 to 1,000 times the cost of RAG), or when latency requirements are tight. The “lost in the middle” effect also limits naive context stuffing: LLMs systematically underweight information placed in the middle of long contexts, making full-corpus injection less reliable than it appears on first consideration.
What is the best chunking strategy?
There is no universal winner. Recursive character splitting at 400 to 512 tokens with 10 to 20 percent overlap is the sensible default. For document types with strong structural cues (legal filings, API docs, financial filings), switch to structural chunking that respects sections and headings. Semantic chunking only earns its ingestion cost if you have measured recall and it is the actual bottleneck. The right strategy depends on what the documents look like; the wrong move is to assume “smarter” chunking is always better.
What is the best vector database for RAG?
At 1M vectors and typical RAG query patterns, performance differences between major options are small. The decision is mostly about your existing stack: teams running Postgres should look at pgvector; teams that want no infrastructure to manage should consider Pinecone or Weaviate Cloud; teams that want the fastest open-source option should look at Qdrant. Invest more time in embedding model selection. That choice has more impact on retrieval quality than which vector database you pick.
Does RAG eliminate hallucination?
No. Grounding shifts the hallucination distribution down, it does not zero it out. The Stanford legal-AI study put the residual rate between 17 and 33 percent even with retrieval in place. Two failure paths drive that floor: retrieval whiffs (the answer was in the corpus, the search step did not surface it), or the model favors its training weights over the retrieved context (parametric override). RAGAS’s faithfulness metric measures whether the generated answer is supported by what was retrieved. A high score confirms grounding but says nothing about whether the retrieved document was accurate to begin with.
What to Read Next
- What language models know (and don’t know) covers the parametric memory model that RAG extends
- Why LLMs hallucinate even when given correct context explains the failure modes that persist after retrieval
- ChatGPT vs. Claude vs. Gemini vs. Grok covers context window differences that affect when long-context injection beats RAG
About the author: Ross Taylor runs Alameda Internet Marketing, an AI-native agency that uses AI tooling daily on real client accounts. He has written previously on how LLMs work and why hallucination persists. This article reflects hands-on experience building and running RAG systems in production.