Question 1

Why is next-token prediction an effective pre-training objective?

Accepted Answer

Next-token prediction is a general-purpose objective — to predict what comes next in text, a model must implicitly learn syntax, semantics, factual relationships, reasoning patterns, and conversational structure. The training signal is derived entirely from the text itself (no human labels needed), enabling training on internet-scale data. Radford et al. (2019) demonstrated that GPT-2 acquires diverse capabilities (summarization, translation, QA) purely from next-token prediction.

Question 2

What is the relationship between cross-entropy loss and perplexity?

Accepted Answer

The average negative log-likelihood (NLL) per token is the cross-entropy loss: CE = −(1/N) Σ log P(x_t | context). Perplexity is the exponential of this: PPL = exp(CE). A model with perplexity 35 assigns the correct next token an average probability of approximately 1/35 ≈ 2.9%. Lower perplexity indicates the model is better calibrated to the data distribution. Comparing perplexities across models requires identical tokenization.

Question 3

Why can't the model attend to future tokens during training?

Accepted Answer

During pre-training with causal language modeling, the model must predict x_t using only x_1,...,x_{t-1}. If the model could attend to x_{t+1} when predicting x_t, the task becomes trivially easy — the answer is always in the context. The causal attention mask sets all attention weights from position t to positions > t to −∞ before softmax, effectively zeroing them out and enforcing this constraint. This mask also makes the architecture directly usable for autoregressive generation at inference time.

Measure	Value	Unit	Notes
Training objective	max Σₜ log P(x_t \| x_1,...,x_{t-1})		Equivalently, minimize cross-entropy H(y, ŷ) = −Σ y_i log ŷ_i
Perplexity formula	PPL = exp(−(1/N) Σₜ log P(x_t \| x_{<t}))		Geometric mean of inverse probabilities; lower is better
GPT-2 117M perplexity (Penn Treebank)	35.1	PPL	Radford et al. (2019); zero-shot, no fine-tuning; SOTA at that time was ~34 with fine-tuning
Context for causal mask	Left-only		Attention mask sets upper triangle to −∞ before softmax; tokens cannot attend to future positions
Tokens per batch (GPT-3)	3.2 million	tokens/batch	Large batches reduce gradient variance; 3.2M tokens across sequences of 2,048 tokens

	t=1	t=2	t=3	t=4
Position 1	✓	✗	✗	✗
Position 2	✓	✓	✗	✗
Position 3	✓	✓	✓	✗
Position 4	✓	✓	✓	✓

Model	Parameters	Penn Treebank PPL	Notes
5-gram (Kneser-Ney)	—	141	Classic n-gram baseline
LSTM (Merity et al., 2018)	33M	57.3	State-of-art LSTM with fine-tuning
GPT-2 117M	117M	35.1	Zero-shot; Radford et al. (2019)
Transformer-XL (Dai et al.)	257M	21.8	Recurrence for long context

Next-Token Prediction: Causal Language Modeling Objective and Perplexity

The Objective

Causal Masking

Perplexity Benchmarks

Teacher Forcing

Related Pages

Sources

Frequently Asked Questions

Why is next-token prediction an effective pre-training objective?

What is the relationship between cross-entropy loss and perplexity?

Why can't the model attend to future tokens during training?