In-Context Learning: Task Adaptation from Prompt Examples Without Weight Updates

Name: In-Context Learning: Task Adaptation from Prompt Examples Without Weight Updates
Creator: AI Tower
Published: 2026-02-27

Category: agents-applications Updated: 2026-02-27

Brown et al. (NeurIPS 2020) GPT-3: k-shot ICL from prompt examples without weight updates; 32-shot achieves 79.3 on SuperGLUE vs 88.9 fine-tuned BERT; Min et al. (EMNLP 2022): randomly flipping demonstration labels drops accuracy only ~10%, indicating format/distribution matters more than correct labels.

Key Data Points
Measure	Value	Unit	Notes
GPT-3 32-shot SuperGLUE score	79.3	points	Brown et al. (2020): GPT-3 175B 32-shot; fine-tuned BERT-large achieves 88.9 — 9.6-point gap
GPT-3 1-shot TriviaQA	68.0%	Exact Match	Brown et al. (2020): 0-shot = 64.3%; fine-tuned T5 = 50.1%; ICL surpasses fine-tuned T5
ICL emergent parameter threshold	~1B parameters	parameters	Brown et al. (2020): meaningful ICL gains appear above ~1B parameters; minimal below
Label-flip impact on ICL accuracy	~10% drop	% accuracy	Min et al. (2022): randomly flipping all demonstration labels drops accuracy only ~10%, not ~50%

In-context learning (ICL) is the ability of large language models to adapt to new tasks by processing demonstrations in the input prompt — without any gradient updates to model weights. GPT-3 (Brown et al., 2020) demonstrated at scale that a single pre-trained model can perform hundreds of different tasks depending solely on how the prompt is structured.

How In-Context Learning Works

A k-shot ICL prompt provides k labeled examples followed by the test input:

Input: The food was delicious. → Sentiment: Positive
Input: The service was terrible. → Sentiment: Negative
Input: The room was clean. → Sentiment: [MODEL PREDICTS]

The model generates predictions conditioned on all preceding context, using attention over the examples to identify the task structure and expected output format.

GPT-3 ICL Performance (Brown et al., 2020)

Task	0-shot	1-shot	Few-shot	Fine-tuned SOTA
TriviaQA	64.3%	68.0%	71.2%	~75%
WebQuestions	14.4%	25.3%	41.5%	41.7%
CoQA (F1)	81.5%	84.0%	85.0%	~90%
SuperGLUE	~71	~75	79.3	88.9

Scaling and ICL Ability

Model Size	SuperGLUE (few-shot)	ICL Benefit
350M	~52	Minimal
1.3B	~58	Small
6.7B	~66	Moderate
13B	~69	Clear
175B	79.3	Strong

The Bayesian Interpretation (Xie et al., 2021)

Xie et al. model ICL as implicit Bayesian inference:

Pre-trained LM has prior P(concept) over task concepts from training data structure
k demonstrations are Bayesian evidence updating this prior: P(concept | demos)
Generation conditions on the posterior over task concepts

This explains two key observations:

ICL improves with more examples (more evidence)
Correct labels matter less than format (demonstrations identify the concept, not its mapping)

What ICL Does and Does Not Learn

Component of Demonstration	Impact on Accuracy
Input-output format	High impact; model must match output structure
Label space (set of possible outputs)	High impact
Input distribution (what examples look like)	Moderate impact
Correct input-label mappings	Low impact (~10% when all flipped)

See few-shot-learning for k-shot performance benchmarks across tasks, chain-of-thought for reasoning-trace augmentation that dramatically improves ICL on math, and emergent-capabilities for why ICL only emerges at large scale.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Does in-context learning actually learn from labeled examples, or does it retrieve task patterns?

Min et al. (2022) found that randomly flipping all demonstration labels reduces accuracy by only ~10% (not ~50% as would be expected if correct labels were essential). This suggests ICL primarily identifies which task format to apply — the input format, output format, label space, and data distribution — rather than learning from individual labeled examples. Xie et al. (2021) formalize this as Bayesian inference: demonstrations are evidence that updates a prior over task concepts encoded during pre-training.

What is the difference between in-context learning and fine-tuning?

Fine-tuning updates model weights via gradient descent on task-specific data, permanently adapting the model. In-context learning freezes all weights — adaptation occurs entirely through the attention mechanism processing the prompt. Fine-tuning typically achieves 10–20 higher accuracy (on benchmark comparisons) but requires separate weights per task, training compute, and labeled data. ICL is instant, requires no training, handles many tasks from a single model, but is limited by context window length and provides noisier adaptation.

← All AI pages · Dashboard