Retrieval-Augmented Generation: Grounding Language Models in External Knowledge
RAG (Lewis et al., NeurIPS 2020): retriever encodes query q and passages d_i as dense vectors; top-k retrieved by maximum inner product search; RAG-Token achieved 44.5% Exact Match on NaturalQuestions vs 29.8% closed-book T5; DPR retriever top-1 accuracy 78.4% vs BM25 59.1%.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| RAG-Token NaturalQuestions accuracy | 44.5% | Exact Match | Lewis et al. (2020): vs 29.8% closed-book T5 baseline; +14.7% absolute improvement |
| DPR top-1 retrieval accuracy | 78.4% | % top-1 | Karpukhin et al. (2020): Dense Passage Retrieval on NQ test set; vs BM25 59.1% |
| Fusion-in-Decoder top-k | k = 100 documents | passages | Izacard & Grave (2021): FiD model scales to k=100 retrieved passages; 67.6% EM on NaturalQuestions |
| Context window overhead per passage | ~130 tokens per passage | tokens | 100-word passages ≈ 130 tokens; k=5 adds ~650 tokens to context window budget |
Retrieval-Augmented Generation (RAG) augments language model generation with dynamically retrieved documents from an external knowledge source. By combining parametric memory (model weights) with non-parametric memory (a retrieval corpus), RAG enables factual generation without requiring all world knowledge to be encoded in parameters during training.
Architecture
RAG consists of two components:
Retriever encodes queries and passages as dense vectors. At inference, retrieves top-k passages by maximum inner product search (MIPS):
MIPS(q) = argmax_{i=1..N} { E_q(q) · E_d(d_i) }
Generator is a sequence-to-sequence language model conditioned on (query, retrieved passages):
P(y|x) ≈ Σ_{i=1}^{k} p(d_i|x) · P(y|x, d_i)
Retrieval Method Comparison (NaturalQuestions Test Set)
| Method | Top-1 Accuracy | Top-20 Accuracy | Notes |
|---|---|---|---|
| BM25 (TF-IDF sparse) | 59.1% | 73.7% | Classic IR; no semantic representation |
| DPR (dense) | 78.4% | 85.4% | Karpukhin et al. (2020); dual-encoder BERT |
| BM25 + DPR hybrid | ~80% | ~87% | Combining sparse + dense improves recall |
Open-Domain QA Results (Lewis et al., 2020)
| Model | NQ Exact Match | TriviaQA EM | WebQuestions EM |
|---|---|---|---|
| Closed-book T5-11B | 34.5% | 60.5% | 37.4% |
| Closed-book T5-3B | 29.8% | 50.1% | 24.6% |
| RAG-Sequence | 44.5% | 68.0% | 45.5% |
| RAG-Token | 44.5% | 56.8% | 45.5% |
| Fusion-in-Decoder (k=100) | 67.6% | 80.1% | — |
Retrieval Corpus Design
| Factor | Recommendation |
|---|---|
| Document granularity | 100-word passages outperform full documents; better specificity |
| Embedding model | Domain-specific encoders outperform general models |
| Index freshness | Static FAISS index requires re-indexing for knowledge updates |
| Domain coverage | Domain-specific corpora outperform general web for specialized tasks |
RAG vs Parametric Knowledge
RAG trades higher inference-time cost (retrieval + extended context) for:
- Updateable knowledge: new documents can be indexed without retraining
- Verifiable citations: retrieved passages are explicit and inspectable
- Reduced hallucination: generation grounded in specific retrieved text
See hallucination-mechanisms for why retrieved grounding reduces factual errors, context-window for how retrieved passages consume the token budget, and tool-use-function-calling for generalized external knowledge access.
Related Pages
Sources
- Lewis et al. (2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020
- Karpukhin et al. (2020) — Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020
- Izacard & Grave (2021) — Leveraging Passage Retrieval with Generative Models for Open Domain QA. EACL 2021
Frequently Asked Questions
What is the difference between RAG-Sequence and RAG-Token?
In RAG-Sequence, a single document d_i is retrieved and used for the entire generated output: P(y|x) ≈ Σ_i p(d_i|x) · P(y|x,d_i). One document is sampled and conditions all generation steps. In RAG-Token, a different document can be marginalized at each generation step: P(y_t|x,y_{<t}) ≈ Σ_i p(d_i|x,y_{<t}) · p(y_t|x,y_{<t},d_i). RAG-Token is more flexible but more expensive. Lewis et al. (2020) found RAG-Token outperforms RAG-Sequence on open-domain QA tasks.
Why does RAG help with hallucination?
Closed-book language models must encode all factual knowledge in weights during training; facts absent from or underrepresented in training data cannot be accurately recalled. RAG grounds generation in retrieved documents explicitly provided in context — the model can copy or paraphrase factual information from the retrieved text rather than relying solely on weight-encoded knowledge. However, RAG can still hallucinate if retrieved documents are incorrect or if the model fails to faithfully use retrieved content over its own parametric memory.