Chinchilla Scaling: Compute-Optimal Training and the 20-Token-Per-Parameter Rule
Chinchilla scaling (Hoffmann et al., 2022): compute-optimal training uses ~20 tokens per parameter; Chinchilla-70B (1.4T tokens) outperforms Gopher-280B (300B tokens) using 4× less compute, showing prior large models were severely undertrained.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Compute-optimal tokens per parameter | ~20 | tokens/parameter | Derived from 400+ training runs; N_opt and D_opt both scale as C^{0.5} |
| Chinchilla model size | 70 | billion parameters | Compute-optimal for 5.76×10²³ FLOPs given 1.4T training tokens |
| Chinchilla training tokens | 1.4 | trillion tokens | ~20× N = 20 × 70B = 1.4T tokens; compute-optimal point |
| Chinchilla vs Gopher (280B, 300B tokens) | Chinchilla wins on most benchmarks | Despite 4× fewer parameters and 4× less compute; smaller but properly trained | |
| Approach 1 (fixed compute, vary N) | N ∝ C^{0.49}, D ∝ C^{0.51} | First estimation method from Hoffmann et al.; consistent across all three methods |
The Chinchilla paper (Hoffmann et al., 2022) established that the large language models preceding it were significantly undertrained — they had too many parameters relative to their training token count for the compute budget spent. This finding reshaped scaling strategy and introduced the “20 tokens per parameter” rule for compute-optimal training.
The Central Finding
Hoffmann et al. trained over 400 language models ranging from 70M to 16B parameters, varying the number of training tokens for each model size, all within controlled compute budgets. They found:
For compute budget C, the compute-optimal model has:
- N_opt ≈ (C / (20 × 2 × 6))^{0.5} parameters
- D_opt ≈ 20 × N_opt training tokens
The factor 2 is for forward pass FLOPs per token per parameter, and 6 is the ratio of total training FLOPs to forward pass FLOPs (forward + backward).
Comparison: Kaplan vs Chinchilla Recommendations
| Recommendation | Model | Tokens | FLOPs |
|---|---|---|---|
| Kaplan optimal (2020) | Large N, limited D | Few | C fixed |
| Chinchilla optimal (2022) | N and D scaled equally | ~20×N | C fixed |
Model Benchmarks: Chinchilla vs Contemporaries
| Model | Parameters | Tokens | Compute | Avg Benchmark |
|---|---|---|---|---|
| Gopher | 280B | 300B | 5.76×10²³ | 75.1% |
| GPT-3 | 175B | 300B | ~3.1×10²³ | 73.9% |
| Megatron-Turing | 530B | 270B | 9.9×10²³ | 74.5% |
| Chinchilla | 70B | 1.4T | 5.76×10²³ | 77.7% |
Chinchilla achieves the best performance using the same compute as Gopher but with 4× fewer parameters and 4.7× more training tokens.
Compute-Optimal Points Table
| Compute Budget | Optimal N | Optimal D |
|---|---|---|
| 1 × 10¹⁹ FLOPs | ~1B | ~20B |
| 1 × 10²⁰ FLOPs | ~4B | ~75B |
| 1 × 10²¹ FLOPs | ~11B | ~220B |
| 1 × 10²² ≈ Chinchilla | ~35B | ~700B |
| 5.76 × 10²³ ≈ Gopher | ~70B | ~1.4T |
See scaling-laws for the Kaplan power laws that Chinchilla updated, compute-flops for how to count FLOPs in practice, and training-data-curation for how the required tokens at scale are sourced and filtered.
Related Pages
Sources
- Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models. NeurIPS 2022
- Kaplan et al. (2020) — Scaling Laws for Neural Language Models. arXiv
- Touvron et al. (2023) — LLaMA: Open and Efficient Foundation Language Models. arXiv
Frequently Asked Questions
How did Hoffmann et al. determine the compute-optimal training point?
They used three complementary methods: (1) training many models at fixed compute budgets while varying the model size, then fitting a power law to find optimal N; (2) fitting parametric loss functions L(N, D) to training runs and analytically minimizing under a compute constraint; (3) fitting individual loss curves to extrapolate optimal (N, D) pairs at many compute levels. All three methods converged on N_opt ∝ C^{0.5} and D_opt ∝ C^{0.5}, with approximately 20 tokens per parameter.
Why is the '20 tokens per parameter' rule important for practitioners?
For any given compute budget C (in FLOPs), the rule N_opt ≈ C/(20·6·2) and D_opt ≈ 20·N_opt provides a simple recipe: choose a model with N parameters, train it on ~20N tokens. A practitioner with a specific compute budget can quickly estimate both optimal model size and required dataset size. This shifted the field toward training medium-sized models on more data rather than maximally scaling parameters alone.
Does the Chinchilla rule apply to inference-heavy deployments?
Chinchilla optimizes for training compute efficiency only. For inference-heavy deployments (many queries per model), a smaller model with lower inference cost may be preferable even if it required more tokens to train. Training on additional tokens beyond the Chinchilla-optimal point continues to improve the model (subject to data availability), and the marginal cost of extra training tokens may be justified if it enables using a smaller model at inference time.