Emergent Capabilities: Abilities That Appear Above Scale Thresholds in Language Models

Category: evaluation Updated: 2026-02-27

Wei et al. (TMLR 2022): 137 tasks show near-zero then sharp improvement above scale thresholds in 8 model families; 3-digit arithmetic emerges ~8–13B parameters; Schaeffer et al. (NeurIPS 2023): switching to continuous metrics largely eliminates apparent discontinuities, suggesting measurement artifact.

Key Data Points
MeasureValueUnitNotes
Emergent tasks documented137distinct tasksWei et al. (2022): sourced from BIG-Bench, MMLU, and other benchmarks across 8 model families
3-digit arithmetic emergence threshold~8–13B parametersparametersWei et al. (2022): near-zero accuracy below 8B parameters; above 10B accuracy exceeds 90%
BIG-Bench tasks showing emergence~26%% of BIG-Bench tasksSrivastava et al. (2022): ~74% improve gradually; ~26% show discontinuous emergence pattern
Schaeffer et al. metric experimentDiscontinuities largely disappearSwitching from exact-match to continuous metrics (log-prob of correct token) reveals smooth scaling

Emergent capabilities are abilities of large language models that appear absent at smaller scales and arise sharply above some parameter or compute threshold. Wei et al. (2022) systematically documented this phenomenon across 137 tasks and 8 model families, finding that many tasks show near-zero performance below a threshold and qualitatively better performance above it.

Documented Emergence Examples (Wei et al., 2022)

TaskApproximate ThresholdBehavior at Threshold
3-digit addition~8–13B parameters~5% → >90% accuracy
Multi-step arithmetic~10B parametersNear-zero → strong performance
Word unscrambling~200M–2B parametersEmerges early; varies by word length
Analogical reasoning~10B parametersGeometric and semantic analogies
Multi-step QA (BIG-Bench)~50–100B parametersCompositional reasoning tasks
Few-shot chain-of-thought~100B parametersCoT benefits require extreme scale

The Measurement Artifact Hypothesis (Schaeffer et al., 2023)

Schaeffer et al. tested whether emergence is real by changing evaluation metrics:

Metric TypeArithmetic Task Result
Exact match (binary)Sharp emergence visible at ~10B
Number of correct digits (continuous)Smooth, gradual improvement
Log-probability of correct tokenSmooth scaling; no discontinuity

On arithmetic tasks, switching from “fully correct” (binary) to “number of correct digits” (continuous) shows smooth improvement with scale. The apparent phase transition is imposed by the binary metric rather than observed in the underlying model behavior.

Emergence Across Model Families

Model FamilyParameter Range TestedEmergent Tasks Found
GPT-31B–175B30+ tasks
LaMDA422M–137BMultiple arithmetic/reasoning
PaLM8B–540B25+ tasks
Chinchilla70BSelected compositional tasks

BIG-Bench: Scale of the Emergence Phenomenon

Of ~204 BIG-Bench tasks analyzed by Srivastava et al. (2022):

  • ~74%: smooth, gradual improvement with scale
  • ~26%: emergent pattern (near-zero then sharp improvement)

The emergent subset is biased toward multi-step compositional tasks — those requiring chains of reasoning operations. Simple classification, pattern matching, and retrieval tasks tend to scale smoothly.

Emergence and Capabilities in Practice

CapabilityScale ThresholdNotes
In-context learning~1BWeak below; strong above
Multi-step arithmetic~10BSpecific to exact-match metric
Chain-of-thought reasoning~100BSee chain-of-thought
Complex multi-hop QA~50–100BDepends on task formulation

See scaling-laws for the smooth power-law framework that predicts continuous loss improvement, in-context-learning for emergence of the few-shot adaptation ability, and chain-of-thought for reasoning capabilities requiring extreme scale.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Are emergent abilities real phase transitions or measurement artifacts?

Schaeffer et al. (2023) argue that most apparent emergence is a metric artifact. Tasks evaluated with discontinuous metrics (exact match, pass/fail) show sharp thresholds because the metric changes discontinuously even when the underlying model probability improves smoothly. Switching the same tasks to continuous metrics (log-probability of the correct answer) largely eliminates the discontinuities, revealing smooth scaling. This suggests emergence reflects the choice of evaluation metric more than a genuine capability phase transition.

What is the relationship between emergent capabilities and in-context learning?

In-context learning is itself an emergent capability: Brown et al. (2020) found meaningful few-shot ICL gains appear sharply above ~1B parameters, with small models showing no benefit. Many of the 137 emergent tasks in Wei et al. (2022) are few-shot tasks requiring the model to both understand a task specification and generalize from a small number of examples. This compositional generalization — applying learned task-solving strategies to new task specifications — appears to be the core capability that emerges with scale.

← All AI pages · Dashboard