Question 1

How did Hoffmann et al. determine the compute-optimal training point?

Accepted Answer

They used three complementary methods: (1) training many models at fixed compute budgets while varying the model size, then fitting a power law to find optimal N; (2) fitting parametric loss functions L(N, D) to training runs and analytically minimizing under a compute constraint; (3) fitting individual loss curves to extrapolate optimal (N, D) pairs at many compute levels. All three methods converged on N_opt ∝ C^{0.5} and D_opt ∝ C^{0.5}, with approximately 20 tokens per parameter.

Question 2

Why is the '20 tokens per parameter' rule important for practitioners?

Accepted Answer

For any given compute budget C (in FLOPs), the rule N_opt ≈ C/(20·6·2) and D_opt ≈ 20·N_opt provides a simple recipe: choose a model with N parameters, train it on ~20N tokens. A practitioner with a specific compute budget can quickly estimate both optimal model size and required dataset size. This shifted the field toward training medium-sized models on more data rather than maximally scaling parameters alone.

Question 3

Does the Chinchilla rule apply to inference-heavy deployments?

Accepted Answer

Chinchilla optimizes for training compute efficiency only. For inference-heavy deployments (many queries per model), a smaller model with lower inference cost may be preferable even if it required more tokens to train. Training on additional tokens beyond the Chinchilla-optimal point continues to improve the model (subject to data availability), and the marginal cost of extra training tokens may be justified if it enables using a smaller model at inference time.

Measure	Value	Unit	Notes
Compute-optimal tokens per parameter	~20	tokens/parameter	Derived from 400+ training runs; N_opt and D_opt both scale as C^{0.5}
Chinchilla model size	70	billion parameters	Compute-optimal for 5.76×10²³ FLOPs given 1.4T training tokens
Chinchilla training tokens	1.4	trillion tokens	~20× N = 20 × 70B = 1.4T tokens; compute-optimal point
Chinchilla vs Gopher (280B, 300B tokens)	Chinchilla wins on most benchmarks		Despite 4× fewer parameters and 4× less compute; smaller but properly trained
Approach 1 (fixed compute, vary N)	N ∝ C^{0.49}, D ∝ C^{0.51}		First estimation method from Hoffmann et al.; consistent across all three methods

Recommendation	Model	Tokens	FLOPs
Kaplan optimal (2020)	Large N, limited D	Few	C fixed
Chinchilla optimal (2022)	N and D scaled equally	~20×N	C fixed

Model	Parameters	Tokens	Compute	Avg Benchmark
Gopher	280B	300B	5.76×10²³	75.1%
GPT-3	175B	300B	~3.1×10²³	73.9%
Megatron-Turing	530B	270B	9.9×10²³	74.5%
Chinchilla	70B	1.4T	5.76×10²³	77.7%

Compute Budget	Optimal N	Optimal D
1 × 10¹⁹ FLOPs	~1B	~20B
1 × 10²⁰ FLOPs	~4B	~75B
1 × 10²¹ FLOPs	~11B	~220B
1 × 10²² ≈ Chinchilla	~35B	~700B
5.76 × 10²³ ≈ Gopher	~70B	~1.4T

Chinchilla Scaling: Compute-Optimal Training and the 20-Token-Per-Parameter Rule

The Central Finding

Comparison: Kaplan vs Chinchilla Recommendations

Model Benchmarks: Chinchilla vs Contemporaries

Compute-Optimal Points Table

Related Pages

Sources

Frequently Asked Questions

How did Hoffmann et al. determine the compute-optimal training point?

Why is the '20 tokens per parameter' rule important for practitioners?

Does the Chinchilla rule apply to inference-heavy deployments?