Emergent Capabilities: Abilities That Appear Above Scale Thresholds in Language Models

Name: Emergent Capabilities: Abilities That Appear Above Scale Thresholds in Language Models
Creator: AI Tower
Published: 2026-02-27

Category: evaluation Updated: 2026-02-27

Wei et al. (TMLR 2022): 137 tasks show near-zero then sharp improvement above scale thresholds in 8 model families; 3-digit arithmetic emerges ~8–13B parameters; Schaeffer et al. (NeurIPS 2023): switching to continuous metrics largely eliminates apparent discontinuities, suggesting measurement artifact.

Key Data Points
Measure	Value	Unit	Notes
Emergent tasks documented	137	distinct tasks	Wei et al. (2022): sourced from BIG-Bench, MMLU, and other benchmarks across 8 model families
3-digit arithmetic emergence threshold	~8–13B parameters	parameters	Wei et al. (2022): near-zero accuracy below 8B parameters; above 10B accuracy exceeds 90%
BIG-Bench tasks showing emergence	~26%	% of BIG-Bench tasks	Srivastava et al. (2022): ~74% improve gradually; ~26% show discontinuous emergence pattern
Schaeffer et al. metric experiment	Discontinuities largely disappear		Switching from exact-match to continuous metrics (log-prob of correct token) reveals smooth scaling

Emergent capabilities are abilities of large language models that appear absent at smaller scales and arise sharply above some parameter or compute threshold. Wei et al. (2022) systematically documented this phenomenon across 137 tasks and 8 model families, finding that many tasks show near-zero performance below a threshold and qualitatively better performance above it.

Documented Emergence Examples (Wei et al., 2022)

Task	Approximate Threshold	Behavior at Threshold
3-digit addition	~8–13B parameters	~5% → >90% accuracy
Multi-step arithmetic	~10B parameters	Near-zero → strong performance
Word unscrambling	~200M–2B parameters	Emerges early; varies by word length
Analogical reasoning	~10B parameters	Geometric and semantic analogies
Multi-step QA (BIG-Bench)	~50–100B parameters	Compositional reasoning tasks
Few-shot chain-of-thought	~100B parameters	CoT benefits require extreme scale

The Measurement Artifact Hypothesis (Schaeffer et al., 2023)

Schaeffer et al. tested whether emergence is real by changing evaluation metrics:

Metric Type	Arithmetic Task Result
Exact match (binary)	Sharp emergence visible at ~10B
Number of correct digits (continuous)	Smooth, gradual improvement
Log-probability of correct token	Smooth scaling; no discontinuity

On arithmetic tasks, switching from “fully correct” (binary) to “number of correct digits” (continuous) shows smooth improvement with scale. The apparent phase transition is imposed by the binary metric rather than observed in the underlying model behavior.

Emergence Across Model Families

Model Family	Parameter Range Tested	Emergent Tasks Found
GPT-3	1B–175B	30+ tasks
LaMDA	422M–137B	Multiple arithmetic/reasoning
PaLM	8B–540B	25+ tasks
Chinchilla	70B	Selected compositional tasks

BIG-Bench: Scale of the Emergence Phenomenon

Of ~204 BIG-Bench tasks analyzed by Srivastava et al. (2022):

~74%: smooth, gradual improvement with scale
~26%: emergent pattern (near-zero then sharp improvement)

The emergent subset is biased toward multi-step compositional tasks — those requiring chains of reasoning operations. Simple classification, pattern matching, and retrieval tasks tend to scale smoothly.

Emergence and Capabilities in Practice

Capability	Scale Threshold	Notes
In-context learning	~1B	Weak below; strong above
Multi-step arithmetic	~10B	Specific to exact-match metric
Chain-of-thought reasoning	~100B	See chain-of-thought
Complex multi-hop QA	~50–100B	Depends on task formulation

See scaling-laws for the smooth power-law framework that predicts continuous loss improvement, in-context-learning for emergence of the few-shot adaptation ability, and chain-of-thought for reasoning capabilities requiring extreme scale.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Are emergent abilities real phase transitions or measurement artifacts?

Schaeffer et al. (2023) argue that most apparent emergence is a metric artifact. Tasks evaluated with discontinuous metrics (exact match, pass/fail) show sharp thresholds because the metric changes discontinuously even when the underlying model probability improves smoothly. Switching the same tasks to continuous metrics (log-probability of the correct answer) largely eliminates the discontinuities, revealing smooth scaling. This suggests emergence reflects the choice of evaluation metric more than a genuine capability phase transition.

What is the relationship between emergent capabilities and in-context learning?

In-context learning is itself an emergent capability: Brown et al. (2020) found meaningful few-shot ICL gains appear sharply above ~1B parameters, with small models showing no benefit. Many of the 137 emergent tasks in Wei et al. (2022) are few-shot tasks requiring the model to both understand a task specification and generalize from a small number of examples. This compositional generalization — applying learned task-solving strategies to new task specifications — appears to be the core capability that emerges with scale.

← All AI pages · Dashboard