Few-Shot Learning: Language Model Task Performance from k In-Context Demonstrations

Name: Few-Shot Learning: Language Model Task Performance from k In-Context Demonstrations
Creator: AI Tower
Published: 2026-02-27

Category: agents-applications Updated: 2026-02-27

Brown et al. (NeurIPS 2020): GPT-3 175B 32-shot SuperGLUE = 79.3 vs fine-tuned BERT 88.9; Zhao et al. (ICML 2021): different orderings of same k examples produce up to ±15% accuracy variance; calibrating against neutral-input priors reduces order sensitivity.

Key Data Points
Measure	Value	Unit	Notes
GPT-3 32-shot SuperGLUE	79.3 points	points	Brown et al. (2020): 32 examples in context; fine-tuned BERT-large = 88.9; 9.6-point gap
Few-shot vs fine-tuning accuracy gap	10–20%	% accuracy	Consistent gap across NLP benchmarks; fine-tuning remains more accurate for most tasks
Example order sensitivity	up to ±15%	% accuracy variance	Zhao et al. (2021): same k examples in different orders produce large accuracy swings on classification
Standard k values benchmarked	k = 0, 1, 10, 32	shots	Brown et al. (2020): 0-shot, 1-shot, and 'few-shot' (context window limit) are standard conditions

Few-shot learning in language models refers to task performance given only k labeled demonstrations in the input prompt, with no gradient updates. Brown et al. (2020)‘s GPT-3 paper established the standard evaluation protocol: benchmark 0-shot, 1-shot, and up to 32-shot (or context-window-limited) performance across diverse tasks.

Standard Benchmark Conditions

Condition	Prompt Examples	Weight Update	Notes
Zero-shot	0	No	Task instruction only
One-shot	1	No	Single input-output demonstration
Few-shot	2–32 (context-limited)	No	Typically 10–32 in GPT-3 paper
Fine-tuned	0 at inference	Yes	Trained on k examples before deployment

GPT-3 Few-Shot vs Fine-Tuning (Brown et al., 2020)

Benchmark	GPT-3 0-shot	GPT-3 few-shot	Fine-tuned SOTA
SuperGLUE	~71	79.3	88.9 (BERT-large)
SQuAD v2 (F1)	~69	89.2	91.1
TriviaQA	64.3%	71.2%	~75%
HellaSwag	78.9%	79.3%	86.5% (ALBERT)
NaturalQuestions	14.6%	29.9%	~50% (T5)

Scaling: Few-Shot Accuracy vs Model Size

Model Size	SuperGLUE (few-shot)	Incremental Gain
1.3B	~58	—
6.7B	~66	+8
13B	~69	+3
175B	79.3	+10.3

The largest gains occur at the extremes: from small to medium scale (capacity for basic task understanding) and from large to very large scale (multi-step compositional reasoning).

Prompt Calibration (Zhao et al., 2021)

Language models exhibit two systematic biases in few-shot classification:

Bias	Cause	Calibration Fix
Recency bias	Last example in context gets higher attention weight	Average accuracy over multiple orderings
Majority-label bias	Pre-training prior favors common label strings	Divide probabilities by neutral-input priors

Calibration procedure: compute the model’s predicted probabilities for each label when the input is a neutral string (“N/A”). Use these as priors: p̃(y|x) = p(y|x) / p(y|“N/A”). This substantially reduces order sensitivity and improves calibration.

Few-Shot vs Fine-Tuning: Decision Factors

Factor	Few-Shot Preferred	Fine-Tuning Preferred
Labeled data available	<100 examples	>1000 examples
Task stability	Transient / experimental	Stable, production use
Model count	Single model, many tasks	Separate model per task acceptable
Accuracy requirement	Tolerant of 10–20% gap	Gap is critical
Inference cost	Standard	Can afford extra compute

See in-context-learning for the theoretical account of why few-shot prompting works, prompt-engineering for techniques to reduce order sensitivity, and fine-tuning for when weight updates outperform prompting.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why is few-shot performance sensitive to example order?

Zhao et al. (2021) found accuracy swings of up to ±15% from reordering the same k examples. The cause is a recency bias: tokens near the end of the context receive higher attention weight, making the last few examples disproportionately influential. The model is also biased toward label frequencies matching what it saw during pre-training. Calibration — dividing output probabilities by priors computed on neutral inputs — significantly reduces both recency bias and majority-label bias.

When should few-shot prompting be preferred over fine-tuning?

Few-shot prompting is preferable when: (1) labeled data is very scarce (<100 examples) — insufficient to fine-tune reliably; (2) tasks are transient or low-priority; (3) a single deployed model must handle many different tasks; (4) rapid prototyping without retraining is needed. Fine-tuning is preferable when accuracy is critical, data is available (>1000 examples), the task is stable, and the 10–20% accuracy gap over few-shot matters for the application.

← All AI pages · Dashboard