Retrieval-Augmented Generation: Grounding Language Models in External Knowledge

Name: Retrieval-Augmented Generation: Grounding Language Models in External Knowledge
Creator: AI Tower
Published: 2026-02-27

Category: agents-applications Updated: 2026-02-27

RAG (Lewis et al., NeurIPS 2020): retriever encodes query q and passages d_i as dense vectors; top-k retrieved by maximum inner product search; RAG-Token achieved 44.5% Exact Match on NaturalQuestions vs 29.8% closed-book T5; DPR retriever top-1 accuracy 78.4% vs BM25 59.1%.

Key Data Points
Measure	Value	Unit	Notes
RAG-Token NaturalQuestions accuracy	44.5%	Exact Match	Lewis et al. (2020): vs 29.8% closed-book T5 baseline; +14.7% absolute improvement
DPR top-1 retrieval accuracy	78.4%	% top-1	Karpukhin et al. (2020): Dense Passage Retrieval on NQ test set; vs BM25 59.1%
Fusion-in-Decoder top-k	k = 100 documents	passages	Izacard & Grave (2021): FiD model scales to k=100 retrieved passages; 67.6% EM on NaturalQuestions
Context window overhead per passage	~130 tokens per passage	tokens	100-word passages ≈ 130 tokens; k=5 adds ~650 tokens to context window budget

Retrieval-Augmented Generation (RAG) augments language model generation with dynamically retrieved documents from an external knowledge source. By combining parametric memory (model weights) with non-parametric memory (a retrieval corpus), RAG enables factual generation without requiring all world knowledge to be encoded in parameters during training.

Architecture

RAG consists of two components:

Retriever encodes queries and passages as dense vectors. At inference, retrieves top-k passages by maximum inner product search (MIPS):

MIPS(q) = argmax_{i=1..N} { E_q(q) · E_d(d_i) }

Generator is a sequence-to-sequence language model conditioned on (query, retrieved passages):

P(y|x) ≈ Σ_{i=1}^{k} p(d_i|x) · P(y|x, d_i)

Retrieval Method Comparison (NaturalQuestions Test Set)

Method	Top-1 Accuracy	Top-20 Accuracy	Notes
BM25 (TF-IDF sparse)	59.1%	73.7%	Classic IR; no semantic representation
DPR (dense)	78.4%	85.4%	Karpukhin et al. (2020); dual-encoder BERT
BM25 + DPR hybrid	~80%	~87%	Combining sparse + dense improves recall

Open-Domain QA Results (Lewis et al., 2020)

Model	NQ Exact Match	TriviaQA EM	WebQuestions EM
Closed-book T5-11B	34.5%	60.5%	37.4%
Closed-book T5-3B	29.8%	50.1%	24.6%
RAG-Sequence	44.5%	68.0%	45.5%
RAG-Token	44.5%	56.8%	45.5%
Fusion-in-Decoder (k=100)	67.6%	80.1%	—

Retrieval Corpus Design

Factor	Recommendation
Document granularity	100-word passages outperform full documents; better specificity
Embedding model	Domain-specific encoders outperform general models
Index freshness	Static FAISS index requires re-indexing for knowledge updates
Domain coverage	Domain-specific corpora outperform general web for specialized tasks

RAG vs Parametric Knowledge

RAG trades higher inference-time cost (retrieval + extended context) for:

Updateable knowledge: new documents can be indexed without retraining
Verifiable citations: retrieved passages are explicit and inspectable
Reduced hallucination: generation grounded in specific retrieved text

See hallucination-mechanisms for why retrieved grounding reduces factual errors, context-window for how retrieved passages consume the token budget, and tool-use-function-calling for generalized external knowledge access.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What is the difference between RAG-Sequence and RAG-Token?

In RAG-Sequence, a single document d_i is retrieved and used for the entire generated output: P(y|x) ≈ Σ_i p(d_i|x) · P(y|x,d_i). One document is sampled and conditions all generation steps. In RAG-Token, a different document can be marginalized at each generation step: P(y_t|x,y_{<t}) ≈ Σ_i p(d_i|x,y_{<t}) · p(y_t|x,y_{<t},d_i). RAG-Token is more flexible but more expensive. Lewis et al. (2020) found RAG-Token outperforms RAG-Sequence on open-domain QA tasks.

Why does RAG help with hallucination?

Closed-book language models must encode all factual knowledge in weights during training; facts absent from or underrepresented in training data cannot be accurately recalled. RAG grounds generation in retrieved documents explicitly provided in context — the model can copy or paraphrase factual information from the retrieved text rather than relying solely on weight-encoded knowledge. However, RAG can still hallucinate if retrieved documents are incorrect or if the model fails to faithfully use retrieved content over its own parametric memory.

← All AI pages · Dashboard