Question 1

What are neural scaling laws and why do they matter?

Accepted Answer

Neural scaling laws are empirical relationships showing that language model loss (measured by cross-entropy on held-out text) decreases predictably as a power function of model size, dataset size, or compute. Kaplan et al. (2020) demonstrated these relationships hold across 7 orders of magnitude in compute. This predictability enables researchers to forecast model capabilities before training, allocate compute budgets optimally, and design experiments efficiently.

Question 2

Why did Chinchilla overturn the Kaplan scaling law conclusions?

Accepted Answer

Kaplan et al. (2020) found that for fixed compute, increasing model size helped more than increasing dataset size, leading to the practice of training large models on relatively few tokens. Hoffmann et al. (2022) showed this was incorrect due to insufficient hyperparameter tuning of smaller models. With properly tuned baselines, they found parameters and tokens should scale equally with compute. Their Chinchilla-70B model (70B params, 1.4T tokens) outperformed GPT-3 (175B, 300B tokens) using 4× less compute.

Question 3

What is the 'emergent abilities' threshold implied by scaling laws?

Accepted Answer

Scaling laws model the smooth, continuous decrease in loss as models grow. However, Wei et al. (2022) documented 'emergent abilities' — capabilities like multi-step arithmetic, analogical reasoning, and certain NLP tasks that appear suddenly at specific scale thresholds rather than improving continuously. The apparent disconnect is partly a measurement artifact: binary evaluation metrics (pass/fail) can show discontinuous jumps even when the underlying loss is improving smoothly.

Measure	Value	Unit	Notes
Kaplan loss vs parameters	L(N) = (N_c/N)^{α_N}, α_N = 0.076		N_c ≈ 8.8×10¹³; power law with exponent 0.076 across 6 orders of magnitude
Kaplan loss vs dataset size	L(D) = (D_c/D)^{α_D}, α_D = 0.095		D_c ≈ 5.4×10¹³ tokens; loss decreases predictably as dataset grows
Chinchilla optimal parameter scaling	N_opt ∝ C^{0.5}		Hoffmann et al. (2022): both N and D should scale equally with compute budget C
Chinchilla 70B optimal tokens	1.4 trillion	tokens	Compute-optimal for 70B parameter model; N_opt ≈ 20× fewer tokens than parameters
Pre-Chinchilla models (undertrained)	~20× undertrained		GPT-3 175B on 300B tokens; Chinchilla says 175B should train on ~3.5T tokens

Source	Recommendation	Implication
Kaplan et al. (2020)	N scales faster than D with compute	Train large models on limited tokens
Chinchilla (2022)	N_opt ∝ C^{0.5}, D_opt ∝ C^{0.5}	Equal scaling of N and D

Compute (FLOPs)	Optimal N	Optimal D (tokens)
10¹⁸	400M	7.7B
10¹⁹	1B	22B
10²⁰	4.6B	86B
10²¹	22B	400B
10²³	470B	9.3T

Model	Parameters	Training Tokens	Chinchilla-Optimal Tokens
GPT-3	175B	300B	~3.5T
Gopher	280B	300B	~5.6T
Chinchilla	70B	1.4T	1.4T (optimal)

Scaling Laws: How Language Model Performance Scales with Parameters, Data, and Compute

Kaplan et al. (2020) Power Laws

Compute-Optimal Training: Kaplan vs Chinchilla

Chinchilla Optimal Configurations (Hoffmann et al., Table A9)

Key Implication: Most Large Models Were Undertrained

Related Pages

Sources

Frequently Asked Questions

What are neural scaling laws and why do they matter?

Why did Chinchilla overturn the Kaplan scaling law conclusions?

What is the 'emergent abilities' threshold implied by scaling laws?