Emergent Capabilities: Abilities That Appear Above Scale Thresholds in Language Models
Wei et al. (TMLR 2022): 137 tasks show near-zero then sharp improvement above scale thresholds in 8 model families; 3-digit arithmetic emerges ~8–13B parameters; Schaeffer et al. (NeurIPS 2023): switching to continuous metrics largely eliminates apparent discontinuities, suggesting measurement artifact.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Emergent tasks documented | 137 | distinct tasks | Wei et al. (2022): sourced from BIG-Bench, MMLU, and other benchmarks across 8 model families |
| 3-digit arithmetic emergence threshold | ~8–13B parameters | parameters | Wei et al. (2022): near-zero accuracy below 8B parameters; above 10B accuracy exceeds 90% |
| BIG-Bench tasks showing emergence | ~26% | % of BIG-Bench tasks | Srivastava et al. (2022): ~74% improve gradually; ~26% show discontinuous emergence pattern |
| Schaeffer et al. metric experiment | Discontinuities largely disappear | Switching from exact-match to continuous metrics (log-prob of correct token) reveals smooth scaling |
Emergent capabilities are abilities of large language models that appear absent at smaller scales and arise sharply above some parameter or compute threshold. Wei et al. (2022) systematically documented this phenomenon across 137 tasks and 8 model families, finding that many tasks show near-zero performance below a threshold and qualitatively better performance above it.
Documented Emergence Examples (Wei et al., 2022)
| Task | Approximate Threshold | Behavior at Threshold |
|---|---|---|
| 3-digit addition | ~8–13B parameters | ~5% → >90% accuracy |
| Multi-step arithmetic | ~10B parameters | Near-zero → strong performance |
| Word unscrambling | ~200M–2B parameters | Emerges early; varies by word length |
| Analogical reasoning | ~10B parameters | Geometric and semantic analogies |
| Multi-step QA (BIG-Bench) | ~50–100B parameters | Compositional reasoning tasks |
| Few-shot chain-of-thought | ~100B parameters | CoT benefits require extreme scale |
The Measurement Artifact Hypothesis (Schaeffer et al., 2023)
Schaeffer et al. tested whether emergence is real by changing evaluation metrics:
| Metric Type | Arithmetic Task Result |
|---|---|
| Exact match (binary) | Sharp emergence visible at ~10B |
| Number of correct digits (continuous) | Smooth, gradual improvement |
| Log-probability of correct token | Smooth scaling; no discontinuity |
On arithmetic tasks, switching from “fully correct” (binary) to “number of correct digits” (continuous) shows smooth improvement with scale. The apparent phase transition is imposed by the binary metric rather than observed in the underlying model behavior.
Emergence Across Model Families
| Model Family | Parameter Range Tested | Emergent Tasks Found |
|---|---|---|
| GPT-3 | 1B–175B | 30+ tasks |
| LaMDA | 422M–137B | Multiple arithmetic/reasoning |
| PaLM | 8B–540B | 25+ tasks |
| Chinchilla | 70B | Selected compositional tasks |
BIG-Bench: Scale of the Emergence Phenomenon
Of ~204 BIG-Bench tasks analyzed by Srivastava et al. (2022):
- ~74%: smooth, gradual improvement with scale
- ~26%: emergent pattern (near-zero then sharp improvement)
The emergent subset is biased toward multi-step compositional tasks — those requiring chains of reasoning operations. Simple classification, pattern matching, and retrieval tasks tend to scale smoothly.
Emergence and Capabilities in Practice
| Capability | Scale Threshold | Notes |
|---|---|---|
| In-context learning | ~1B | Weak below; strong above |
| Multi-step arithmetic | ~10B | Specific to exact-match metric |
| Chain-of-thought reasoning | ~100B | See chain-of-thought |
| Complex multi-hop QA | ~50–100B | Depends on task formulation |
See scaling-laws for the smooth power-law framework that predicts continuous loss improvement, in-context-learning for emergence of the few-shot adaptation ability, and chain-of-thought for reasoning capabilities requiring extreme scale.
Related Pages
Sources
- Wei et al. (2022) — Emergent Abilities of Large Language Models. TMLR 2022
- Srivastava et al. (2022) — Beyond the Imitation Game: Quantifying and Extrapolating LLM Capabilities (BIG-Bench). TMLR 2023
- Schaeffer et al. (2023) — Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023
Frequently Asked Questions
Are emergent abilities real phase transitions or measurement artifacts?
Schaeffer et al. (2023) argue that most apparent emergence is a metric artifact. Tasks evaluated with discontinuous metrics (exact match, pass/fail) show sharp thresholds because the metric changes discontinuously even when the underlying model probability improves smoothly. Switching the same tasks to continuous metrics (log-probability of the correct answer) largely eliminates the discontinuities, revealing smooth scaling. This suggests emergence reflects the choice of evaluation metric more than a genuine capability phase transition.
What is the relationship between emergent capabilities and in-context learning?
In-context learning is itself an emergent capability: Brown et al. (2020) found meaningful few-shot ICL gains appear sharply above ~1B parameters, with small models showing no benefit. Many of the 137 emergent tasks in Wei et al. (2022) are few-shot tasks requiring the model to both understand a task specification and generalize from a small number of examples. This compositional generalization — applying learned task-solving strategies to new task specifications — appears to be the core capability that emerges with scale.