Scaling Laws: How Language Model Performance Scales with Parameters, Data, and Compute

Category: training Updated: 2026-02-27

Kaplan et al. (2020) found L ∝ N^{-0.076} and L ∝ D^{-0.095}; Chinchilla (2022) revised: optimal N and D both scale as C^{0.5}, so a 70B model should train on ~1.4T tokens to be compute-optimal.

Key Data Points
MeasureValueUnitNotes
Kaplan loss vs parametersL(N) = (N_c/N)^{α_N}, α_N = 0.076N_c ≈ 8.8×10¹³; power law with exponent 0.076 across 6 orders of magnitude
Kaplan loss vs dataset sizeL(D) = (D_c/D)^{α_D}, α_D = 0.095D_c ≈ 5.4×10¹³ tokens; loss decreases predictably as dataset grows
Chinchilla optimal parameter scalingN_opt ∝ C^{0.5}Hoffmann et al. (2022): both N and D should scale equally with compute budget C
Chinchilla 70B optimal tokens1.4 trilliontokensCompute-optimal for 70B parameter model; N_opt ≈ 20× fewer tokens than parameters
Pre-Chinchilla models (undertrained)~20× undertrainedGPT-3 175B on 300B tokens; Chinchilla says 175B should train on ~3.5T tokens

Scaling laws describe how language model performance (measured by cross-entropy loss on held-out text) changes predictably as a function of model size (N), training dataset size (D), and total compute (C). These empirical relationships, holding across many orders of magnitude, are foundational to decisions about model architecture, training budgets, and data collection.

Kaplan et al. (2020) Power Laws

Testing ~170 language models from 768 to 1.5 billion parameters, Kaplan et al. found:

L(N) ≈ (N_c / N)^{α_N} where α_N ≈ 0.076, N_c ≈ 8.8 × 10¹³

L(D) ≈ (D_c / D)^{α_D} where α_D ≈ 0.095, D_c ≈ 5.4 × 10¹³

L(C) ≈ (C_c / C)^{α_C} where α_C ≈ 0.050

These power laws hold over 6+ orders of magnitude in each variable.

Compute-Optimal Training: Kaplan vs Chinchilla

SourceRecommendationImplication
Kaplan et al. (2020)N scales faster than D with computeTrain large models on limited tokens
Chinchilla (2022)N_opt ∝ C^{0.5}, D_opt ∝ C^{0.5}Equal scaling of N and D

The practical implication of Chinchilla: for a fixed compute budget C:

  • Compute-optimal N ≈ 0.1 × √C (in terms of parameters per FLOP)
  • Compute-optimal D ≈ 20 × N training tokens

Chinchilla Optimal Configurations (Hoffmann et al., Table A9)

Compute (FLOPs)Optimal NOptimal D (tokens)
10¹⁸400M7.7B
10¹⁹1B22B
10²⁰4.6B86B
10²¹22B400B
10²³470B9.3T

Key Implication: Most Large Models Were Undertrained

ModelParametersTraining TokensChinchilla-Optimal Tokens
GPT-3175B300B~3.5T
Gopher280B300B~5.6T
Chinchilla70B1.4T1.4T (optimal)

Chinchilla (70B, 1.4T) outperformed Gopher (280B, 300B) on most benchmarks despite using 4× less compute, demonstrating that prior large models were significantly undertrained relative to their parameter count.

See chinchilla-scaling for deeper analysis of the Chinchilla paper, compute-flops for how FLOPs are counted in practice, and emergent-capabilities for how scale-dependent abilities relate to these laws.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What are neural scaling laws and why do they matter?

Neural scaling laws are empirical relationships showing that language model loss (measured by cross-entropy on held-out text) decreases predictably as a power function of model size, dataset size, or compute. Kaplan et al. (2020) demonstrated these relationships hold across 7 orders of magnitude in compute. This predictability enables researchers to forecast model capabilities before training, allocate compute budgets optimally, and design experiments efficiently.

Why did Chinchilla overturn the Kaplan scaling law conclusions?

Kaplan et al. (2020) found that for fixed compute, increasing model size helped more than increasing dataset size, leading to the practice of training large models on relatively few tokens. Hoffmann et al. (2022) showed this was incorrect due to insufficient hyperparameter tuning of smaller models. With properly tuned baselines, they found parameters and tokens should scale equally with compute. Their Chinchilla-70B model (70B params, 1.4T tokens) outperformed GPT-3 (175B, 300B tokens) using 4× less compute.

What is the 'emergent abilities' threshold implied by scaling laws?

Scaling laws model the smooth, continuous decrease in loss as models grow. However, Wei et al. (2022) documented 'emergent abilities' — capabilities like multi-step arithmetic, analogical reasoning, and certain NLP tasks that appear suddenly at specific scale thresholds rather than improving continuously. The apparent disconnect is partly a measurement artifact: binary evaluation metrics (pass/fail) can show discontinuous jumps even when the underlying loss is improving smoothly.

← All AI pages · Dashboard