In-Context Learning: Task Adaptation from Prompt Examples Without Weight Updates

Category: agents-applications Updated: 2026-02-27

Brown et al. (NeurIPS 2020) GPT-3: k-shot ICL from prompt examples without weight updates; 32-shot achieves 79.3 on SuperGLUE vs 88.9 fine-tuned BERT; Min et al. (EMNLP 2022): randomly flipping demonstration labels drops accuracy only ~10%, indicating format/distribution matters more than correct labels.

Key Data Points
MeasureValueUnitNotes
GPT-3 32-shot SuperGLUE score79.3pointsBrown et al. (2020): GPT-3 175B 32-shot; fine-tuned BERT-large achieves 88.9 — 9.6-point gap
GPT-3 1-shot TriviaQA68.0%Exact MatchBrown et al. (2020): 0-shot = 64.3%; fine-tuned T5 = 50.1%; ICL surpasses fine-tuned T5
ICL emergent parameter threshold~1B parametersparametersBrown et al. (2020): meaningful ICL gains appear above ~1B parameters; minimal below
Label-flip impact on ICL accuracy~10% drop% accuracyMin et al. (2022): randomly flipping all demonstration labels drops accuracy only ~10%, not ~50%

In-context learning (ICL) is the ability of large language models to adapt to new tasks by processing demonstrations in the input prompt — without any gradient updates to model weights. GPT-3 (Brown et al., 2020) demonstrated at scale that a single pre-trained model can perform hundreds of different tasks depending solely on how the prompt is structured.

How In-Context Learning Works

A k-shot ICL prompt provides k labeled examples followed by the test input:

Input: The food was delicious. → Sentiment: Positive
Input: The service was terrible. → Sentiment: Negative
Input: The room was clean. → Sentiment: [MODEL PREDICTS]

The model generates predictions conditioned on all preceding context, using attention over the examples to identify the task structure and expected output format.

GPT-3 ICL Performance (Brown et al., 2020)

Task0-shot1-shotFew-shotFine-tuned SOTA
TriviaQA64.3%68.0%71.2%~75%
WebQuestions14.4%25.3%41.5%41.7%
CoQA (F1)81.5%84.0%85.0%~90%
SuperGLUE~71~7579.388.9

Scaling and ICL Ability

Model SizeSuperGLUE (few-shot)ICL Benefit
350M~52Minimal
1.3B~58Small
6.7B~66Moderate
13B~69Clear
175B79.3Strong

The Bayesian Interpretation (Xie et al., 2021)

Xie et al. model ICL as implicit Bayesian inference:

  • Pre-trained LM has prior P(concept) over task concepts from training data structure
  • k demonstrations are Bayesian evidence updating this prior: P(concept | demos)
  • Generation conditions on the posterior over task concepts

This explains two key observations:

  1. ICL improves with more examples (more evidence)
  2. Correct labels matter less than format (demonstrations identify the concept, not its mapping)

What ICL Does and Does Not Learn

Component of DemonstrationImpact on Accuracy
Input-output formatHigh impact; model must match output structure
Label space (set of possible outputs)High impact
Input distribution (what examples look like)Moderate impact
Correct input-label mappingsLow impact (~10% when all flipped)

See few-shot-learning for k-shot performance benchmarks across tasks, chain-of-thought for reasoning-trace augmentation that dramatically improves ICL on math, and emergent-capabilities for why ICL only emerges at large scale.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Does in-context learning actually learn from labeled examples, or does it retrieve task patterns?

Min et al. (2022) found that randomly flipping all demonstration labels reduces accuracy by only ~10% (not ~50% as would be expected if correct labels were essential). This suggests ICL primarily identifies which task format to apply — the input format, output format, label space, and data distribution — rather than learning from individual labeled examples. Xie et al. (2021) formalize this as Bayesian inference: demonstrations are evidence that updates a prior over task concepts encoded during pre-training.

What is the difference between in-context learning and fine-tuning?

Fine-tuning updates model weights via gradient descent on task-specific data, permanently adapting the model. In-context learning freezes all weights — adaptation occurs entirely through the attention mechanism processing the prompt. Fine-tuning typically achieves 10–20 higher accuracy (on benchmark comparisons) but requires separate weights per task, training compute, and labeled data. ICL is instant, requires no training, handles many tasks from a single model, but is limited by context window length and provides noisier adaptation.

← All AI pages · Dashboard