Retrieval-Augmented Generation: Grounding Language Models in External Knowledge

Category: agents-applications Updated: 2026-02-27

RAG (Lewis et al., NeurIPS 2020): retriever encodes query q and passages d_i as dense vectors; top-k retrieved by maximum inner product search; RAG-Token achieved 44.5% Exact Match on NaturalQuestions vs 29.8% closed-book T5; DPR retriever top-1 accuracy 78.4% vs BM25 59.1%.

Key Data Points
MeasureValueUnitNotes
RAG-Token NaturalQuestions accuracy44.5%Exact MatchLewis et al. (2020): vs 29.8% closed-book T5 baseline; +14.7% absolute improvement
DPR top-1 retrieval accuracy78.4%% top-1Karpukhin et al. (2020): Dense Passage Retrieval on NQ test set; vs BM25 59.1%
Fusion-in-Decoder top-kk = 100 documentspassagesIzacard & Grave (2021): FiD model scales to k=100 retrieved passages; 67.6% EM on NaturalQuestions
Context window overhead per passage~130 tokens per passagetokens100-word passages ≈ 130 tokens; k=5 adds ~650 tokens to context window budget

Retrieval-Augmented Generation (RAG) augments language model generation with dynamically retrieved documents from an external knowledge source. By combining parametric memory (model weights) with non-parametric memory (a retrieval corpus), RAG enables factual generation without requiring all world knowledge to be encoded in parameters during training.

Architecture

RAG consists of two components:

Retriever encodes queries and passages as dense vectors. At inference, retrieves top-k passages by maximum inner product search (MIPS):

MIPS(q) = argmax_{i=1..N} { E_q(q) · E_d(d_i) }

Generator is a sequence-to-sequence language model conditioned on (query, retrieved passages):

P(y|x) ≈ Σ_{i=1}^{k} p(d_i|x) · P(y|x, d_i)

Retrieval Method Comparison (NaturalQuestions Test Set)

MethodTop-1 AccuracyTop-20 AccuracyNotes
BM25 (TF-IDF sparse)59.1%73.7%Classic IR; no semantic representation
DPR (dense)78.4%85.4%Karpukhin et al. (2020); dual-encoder BERT
BM25 + DPR hybrid~80%~87%Combining sparse + dense improves recall

Open-Domain QA Results (Lewis et al., 2020)

ModelNQ Exact MatchTriviaQA EMWebQuestions EM
Closed-book T5-11B34.5%60.5%37.4%
Closed-book T5-3B29.8%50.1%24.6%
RAG-Sequence44.5%68.0%45.5%
RAG-Token44.5%56.8%45.5%
Fusion-in-Decoder (k=100)67.6%80.1%

Retrieval Corpus Design

FactorRecommendation
Document granularity100-word passages outperform full documents; better specificity
Embedding modelDomain-specific encoders outperform general models
Index freshnessStatic FAISS index requires re-indexing for knowledge updates
Domain coverageDomain-specific corpora outperform general web for specialized tasks

RAG vs Parametric Knowledge

RAG trades higher inference-time cost (retrieval + extended context) for:

  • Updateable knowledge: new documents can be indexed without retraining
  • Verifiable citations: retrieved passages are explicit and inspectable
  • Reduced hallucination: generation grounded in specific retrieved text

See hallucination-mechanisms for why retrieved grounding reduces factual errors, context-window for how retrieved passages consume the token budget, and tool-use-function-calling for generalized external knowledge access.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What is the difference between RAG-Sequence and RAG-Token?

In RAG-Sequence, a single document d_i is retrieved and used for the entire generated output: P(y|x) ≈ Σ_i p(d_i|x) · P(y|x,d_i). One document is sampled and conditions all generation steps. In RAG-Token, a different document can be marginalized at each generation step: P(y_t|x,y_{<t}) ≈ Σ_i p(d_i|x,y_{<t}) · p(y_t|x,y_{<t},d_i). RAG-Token is more flexible but more expensive. Lewis et al. (2020) found RAG-Token outperforms RAG-Sequence on open-domain QA tasks.

Why does RAG help with hallucination?

Closed-book language models must encode all factual knowledge in weights during training; facts absent from or underrepresented in training data cannot be accurately recalled. RAG grounds generation in retrieved documents explicitly provided in context — the model can copy or paraphrase factual information from the retrieved text rather than relying solely on weight-encoded knowledge. However, RAG can still hallucinate if retrieved documents are incorrect or if the model fails to faithfully use retrieved content over its own parametric memory.

← All AI pages · Dashboard