Chinchilla Scaling: Compute-Optimal Training and the 20-Token-Per-Parameter Rule

Category: training Updated: 2026-02-27

Chinchilla scaling (Hoffmann et al., 2022): compute-optimal training uses ~20 tokens per parameter; Chinchilla-70B (1.4T tokens) outperforms Gopher-280B (300B tokens) using 4× less compute, showing prior large models were severely undertrained.

Key Data Points
MeasureValueUnitNotes
Compute-optimal tokens per parameter~20tokens/parameterDerived from 400+ training runs; N_opt and D_opt both scale as C^{0.5}
Chinchilla model size70billion parametersCompute-optimal for 5.76×10²³ FLOPs given 1.4T training tokens
Chinchilla training tokens1.4trillion tokens~20× N = 20 × 70B = 1.4T tokens; compute-optimal point
Chinchilla vs Gopher (280B, 300B tokens)Chinchilla wins on most benchmarksDespite 4× fewer parameters and 4× less compute; smaller but properly trained
Approach 1 (fixed compute, vary N)N ∝ C^{0.49}, D ∝ C^{0.51}First estimation method from Hoffmann et al.; consistent across all three methods

The Chinchilla paper (Hoffmann et al., 2022) established that the large language models preceding it were significantly undertrained — they had too many parameters relative to their training token count for the compute budget spent. This finding reshaped scaling strategy and introduced the “20 tokens per parameter” rule for compute-optimal training.

The Central Finding

Hoffmann et al. trained over 400 language models ranging from 70M to 16B parameters, varying the number of training tokens for each model size, all within controlled compute budgets. They found:

For compute budget C, the compute-optimal model has:

  • N_opt ≈ (C / (20 × 2 × 6))^{0.5} parameters
  • D_opt ≈ 20 × N_opt training tokens

The factor 2 is for forward pass FLOPs per token per parameter, and 6 is the ratio of total training FLOPs to forward pass FLOPs (forward + backward).

Comparison: Kaplan vs Chinchilla Recommendations

RecommendationModelTokensFLOPs
Kaplan optimal (2020)Large N, limited DFewC fixed
Chinchilla optimal (2022)N and D scaled equally~20×NC fixed

Model Benchmarks: Chinchilla vs Contemporaries

ModelParametersTokensComputeAvg Benchmark
Gopher280B300B5.76×10²³75.1%
GPT-3175B300B~3.1×10²³73.9%
Megatron-Turing530B270B9.9×10²³74.5%
Chinchilla70B1.4T5.76×10²³77.7%

Chinchilla achieves the best performance using the same compute as Gopher but with 4× fewer parameters and 4.7× more training tokens.

Compute-Optimal Points Table

Compute BudgetOptimal NOptimal D
1 × 10¹⁹ FLOPs~1B~20B
1 × 10²⁰ FLOPs~4B~75B
1 × 10²¹ FLOPs~11B~220B
1 × 10²² ≈ Chinchilla~35B~700B
5.76 × 10²³ ≈ Gopher~70B~1.4T

See scaling-laws for the Kaplan power laws that Chinchilla updated, compute-flops for how to count FLOPs in practice, and training-data-curation for how the required tokens at scale are sourced and filtered.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

How did Hoffmann et al. determine the compute-optimal training point?

They used three complementary methods: (1) training many models at fixed compute budgets while varying the model size, then fitting a power law to find optimal N; (2) fitting parametric loss functions L(N, D) to training runs and analytically minimizing under a compute constraint; (3) fitting individual loss curves to extrapolate optimal (N, D) pairs at many compute levels. All three methods converged on N_opt ∝ C^{0.5} and D_opt ∝ C^{0.5}, with approximately 20 tokens per parameter.

Why is the '20 tokens per parameter' rule important for practitioners?

For any given compute budget C (in FLOPs), the rule N_opt ≈ C/(20·6·2) and D_opt ≈ 20·N_opt provides a simple recipe: choose a model with N parameters, train it on ~20N tokens. A practitioner with a specific compute budget can quickly estimate both optimal model size and required dataset size. This shifted the field toward training medium-sized models on more data rather than maximally scaling parameters alone.

Does the Chinchilla rule apply to inference-heavy deployments?

Chinchilla optimizes for training compute efficiency only. For inference-heavy deployments (many queries per model), a smaller model with lower inference cost may be preferable even if it required more tokens to train. Training on additional tokens beyond the Chinchilla-optimal point continues to improve the model (subject to data availability), and the marginal cost of extra training tokens may be justified if it enables using a smaller model at inference time.

← All AI pages · Dashboard