Few-Shot Learning: Language Model Task Performance from k In-Context Demonstrations
Brown et al. (NeurIPS 2020): GPT-3 175B 32-shot SuperGLUE = 79.3 vs fine-tuned BERT 88.9; Zhao et al. (ICML 2021): different orderings of same k examples produce up to ±15% accuracy variance; calibrating against neutral-input priors reduces order sensitivity.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GPT-3 32-shot SuperGLUE | 79.3 points | points | Brown et al. (2020): 32 examples in context; fine-tuned BERT-large = 88.9; 9.6-point gap |
| Few-shot vs fine-tuning accuracy gap | 10–20% | % accuracy | Consistent gap across NLP benchmarks; fine-tuning remains more accurate for most tasks |
| Example order sensitivity | up to ±15% | % accuracy variance | Zhao et al. (2021): same k examples in different orders produce large accuracy swings on classification |
| Standard k values benchmarked | k = 0, 1, 10, 32 | shots | Brown et al. (2020): 0-shot, 1-shot, and 'few-shot' (context window limit) are standard conditions |
Few-shot learning in language models refers to task performance given only k labeled demonstrations in the input prompt, with no gradient updates. Brown et al. (2020)‘s GPT-3 paper established the standard evaluation protocol: benchmark 0-shot, 1-shot, and up to 32-shot (or context-window-limited) performance across diverse tasks.
Standard Benchmark Conditions
| Condition | Prompt Examples | Weight Update | Notes |
|---|---|---|---|
| Zero-shot | 0 | No | Task instruction only |
| One-shot | 1 | No | Single input-output demonstration |
| Few-shot | 2–32 (context-limited) | No | Typically 10–32 in GPT-3 paper |
| Fine-tuned | 0 at inference | Yes | Trained on k examples before deployment |
GPT-3 Few-Shot vs Fine-Tuning (Brown et al., 2020)
| Benchmark | GPT-3 0-shot | GPT-3 few-shot | Fine-tuned SOTA |
|---|---|---|---|
| SuperGLUE | ~71 | 79.3 | 88.9 (BERT-large) |
| SQuAD v2 (F1) | ~69 | 89.2 | 91.1 |
| TriviaQA | 64.3% | 71.2% | ~75% |
| HellaSwag | 78.9% | 79.3% | 86.5% (ALBERT) |
| NaturalQuestions | 14.6% | 29.9% | ~50% (T5) |
Scaling: Few-Shot Accuracy vs Model Size
| Model Size | SuperGLUE (few-shot) | Incremental Gain |
|---|---|---|
| 1.3B | ~58 | — |
| 6.7B | ~66 | +8 |
| 13B | ~69 | +3 |
| 175B | 79.3 | +10.3 |
The largest gains occur at the extremes: from small to medium scale (capacity for basic task understanding) and from large to very large scale (multi-step compositional reasoning).
Prompt Calibration (Zhao et al., 2021)
Language models exhibit two systematic biases in few-shot classification:
| Bias | Cause | Calibration Fix |
|---|---|---|
| Recency bias | Last example in context gets higher attention weight | Average accuracy over multiple orderings |
| Majority-label bias | Pre-training prior favors common label strings | Divide probabilities by neutral-input priors |
Calibration procedure: compute the model’s predicted probabilities for each label when the input is a neutral string (“N/A”). Use these as priors: p̃(y|x) = p(y|x) / p(y|“N/A”). This substantially reduces order sensitivity and improves calibration.
Few-Shot vs Fine-Tuning: Decision Factors
| Factor | Few-Shot Preferred | Fine-Tuning Preferred |
|---|---|---|
| Labeled data available | <100 examples | >1000 examples |
| Task stability | Transient / experimental | Stable, production use |
| Model count | Single model, many tasks | Separate model per task acceptable |
| Accuracy requirement | Tolerant of 10–20% gap | Gap is critical |
| Inference cost | Standard | Can afford extra compute |
See in-context-learning for the theoretical account of why few-shot prompting works, prompt-engineering for techniques to reduce order sensitivity, and fine-tuning for when weight updates outperform prompting.
Related Pages
Sources
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020
- Zhao et al. (2021) — Calibrate Before Use: Improving Few-Shot Performance of Language Models. ICML 2021
Frequently Asked Questions
Why is few-shot performance sensitive to example order?
Zhao et al. (2021) found accuracy swings of up to ±15% from reordering the same k examples. The cause is a recency bias: tokens near the end of the context receive higher attention weight, making the last few examples disproportionately influential. The model is also biased toward label frequencies matching what it saw during pre-training. Calibration — dividing output probabilities by priors computed on neutral inputs — significantly reduces both recency bias and majority-label bias.
When should few-shot prompting be preferred over fine-tuning?
Few-shot prompting is preferable when: (1) labeled data is very scarce (<100 examples) — insufficient to fine-tune reliably; (2) tasks are transient or low-priority; (3) a single deployed model must handle many different tasks; (4) rapid prototyping without retraining is needed. Fine-tuning is preferable when accuracy is critical, data is available (>1000 examples), the task is stable, and the 10–20% accuracy gap over few-shot matters for the application.