Scaling Laws: How Language Model Performance Scales with Parameters, Data, and Compute
Kaplan et al. (2020) found L ∝ N^{-0.076} and L ∝ D^{-0.095}; Chinchilla (2022) revised: optimal N and D both scale as C^{0.5}, so a 70B model should train on ~1.4T tokens to be compute-optimal.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Kaplan loss vs parameters | L(N) = (N_c/N)^{α_N}, α_N = 0.076 | N_c ≈ 8.8×10¹³; power law with exponent 0.076 across 6 orders of magnitude | |
| Kaplan loss vs dataset size | L(D) = (D_c/D)^{α_D}, α_D = 0.095 | D_c ≈ 5.4×10¹³ tokens; loss decreases predictably as dataset grows | |
| Chinchilla optimal parameter scaling | N_opt ∝ C^{0.5} | Hoffmann et al. (2022): both N and D should scale equally with compute budget C | |
| Chinchilla 70B optimal tokens | 1.4 trillion | tokens | Compute-optimal for 70B parameter model; N_opt ≈ 20× fewer tokens than parameters |
| Pre-Chinchilla models (undertrained) | ~20× undertrained | GPT-3 175B on 300B tokens; Chinchilla says 175B should train on ~3.5T tokens |
Scaling laws describe how language model performance (measured by cross-entropy loss on held-out text) changes predictably as a function of model size (N), training dataset size (D), and total compute (C). These empirical relationships, holding across many orders of magnitude, are foundational to decisions about model architecture, training budgets, and data collection.
Kaplan et al. (2020) Power Laws
Testing ~170 language models from 768 to 1.5 billion parameters, Kaplan et al. found:
L(N) ≈ (N_c / N)^{α_N} where α_N ≈ 0.076, N_c ≈ 8.8 × 10¹³
L(D) ≈ (D_c / D)^{α_D} where α_D ≈ 0.095, D_c ≈ 5.4 × 10¹³
L(C) ≈ (C_c / C)^{α_C} where α_C ≈ 0.050
These power laws hold over 6+ orders of magnitude in each variable.
Compute-Optimal Training: Kaplan vs Chinchilla
| Source | Recommendation | Implication |
|---|---|---|
| Kaplan et al. (2020) | N scales faster than D with compute | Train large models on limited tokens |
| Chinchilla (2022) | N_opt ∝ C^{0.5}, D_opt ∝ C^{0.5} | Equal scaling of N and D |
The practical implication of Chinchilla: for a fixed compute budget C:
- Compute-optimal N ≈ 0.1 × √C (in terms of parameters per FLOP)
- Compute-optimal D ≈ 20 × N training tokens
Chinchilla Optimal Configurations (Hoffmann et al., Table A9)
| Compute (FLOPs) | Optimal N | Optimal D (tokens) |
|---|---|---|
| 10¹⁸ | 400M | 7.7B |
| 10¹⁹ | 1B | 22B |
| 10²⁰ | 4.6B | 86B |
| 10²¹ | 22B | 400B |
| 10²³ | 470B | 9.3T |
Key Implication: Most Large Models Were Undertrained
| Model | Parameters | Training Tokens | Chinchilla-Optimal Tokens |
|---|---|---|---|
| GPT-3 | 175B | 300B | ~3.5T |
| Gopher | 280B | 300B | ~5.6T |
| Chinchilla | 70B | 1.4T | 1.4T (optimal) |
Chinchilla (70B, 1.4T) outperformed Gopher (280B, 300B) on most benchmarks despite using 4× less compute, demonstrating that prior large models were significantly undertrained relative to their parameter count.
See chinchilla-scaling for deeper analysis of the Chinchilla paper, compute-flops for how FLOPs are counted in practice, and emergent-capabilities for how scale-dependent abilities relate to these laws.
Related Pages
Sources
- Kaplan et al. (2020) — Scaling Laws for Neural Language Models. arXiv
- Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 2022
- Henighan et al. (2020) — Scaling Laws for Autoregressive Generative Modeling. arXiv
Frequently Asked Questions
What are neural scaling laws and why do they matter?
Neural scaling laws are empirical relationships showing that language model loss (measured by cross-entropy on held-out text) decreases predictably as a power function of model size, dataset size, or compute. Kaplan et al. (2020) demonstrated these relationships hold across 7 orders of magnitude in compute. This predictability enables researchers to forecast model capabilities before training, allocate compute budgets optimally, and design experiments efficiently.
Why did Chinchilla overturn the Kaplan scaling law conclusions?
Kaplan et al. (2020) found that for fixed compute, increasing model size helped more than increasing dataset size, leading to the practice of training large models on relatively few tokens. Hoffmann et al. (2022) showed this was incorrect due to insufficient hyperparameter tuning of smaller models. With properly tuned baselines, they found parameters and tokens should scale equally with compute. Their Chinchilla-70B model (70B params, 1.4T tokens) outperformed GPT-3 (175B, 300B tokens) using 4× less compute.
What is the 'emergent abilities' threshold implied by scaling laws?
Scaling laws model the smooth, continuous decrease in loss as models grow. However, Wei et al. (2022) documented 'emergent abilities' — capabilities like multi-step arithmetic, analogical reasoning, and certain NLP tasks that appear suddenly at specific scale thresholds rather than improving continuously. The apparent disconnect is partly a measurement artifact: binary evaluation metrics (pass/fail) can show discontinuous jumps even when the underlying loss is improving smoothly.