Next-Token Prediction: Causal Language Modeling Objective and Perplexity

Category: training Updated: 2026-02-27

Causal language modeling maximizes log P(x) = Σₜ log P(x_t | x_{<t}); perplexity = exp(−(1/N)Σ log P(x_t|context)); GPT-2 117M achieved perplexity 35.1 on Penn Treebank without fine-tuning (Radford et al., 2019).

Key Data Points
MeasureValueUnitNotes
Training objectivemax Σₜ log P(x_t | x_1,...,x_{t-1})Equivalently, minimize cross-entropy H(y, ŷ) = −Σ y_i log ŷ_i
Perplexity formulaPPL = exp(−(1/N) Σₜ log P(x_t | x_{<t}))Geometric mean of inverse probabilities; lower is better
GPT-2 117M perplexity (Penn Treebank)35.1PPLRadford et al. (2019); zero-shot, no fine-tuning; SOTA at that time was ~34 with fine-tuning
Context for causal maskLeft-onlyAttention mask sets upper triangle to −∞ before softmax; tokens cannot attend to future positions
Tokens per batch (GPT-3)3.2 milliontokens/batchLarge batches reduce gradient variance; 3.2M tokens across sequences of 2,048 tokens

Next-token prediction (causal language modeling) is the training objective that transforms a transformer decoder into a language model. By training the model to predict each token from its preceding context, the model learns general-purpose language representations without any task-specific supervision.

The Objective

For a sequence of tokens (x₁, x₂, …, x_N), the language model objective is to maximize:

log P(x) = Σₜ₌₁ᴺ log P(x_t | x₁, x₂, …, x_{t-1})

Each P(x_t | x₁,…,x_{t-1}) is computed by:

  1. Running the causal transformer to obtain hidden state h_t at position t
  2. Projecting h_t through the output (unembedding) layer: logits = h_t · W_E^T
  3. Applying softmax to get a probability distribution over vocabulary

The loss is the sum of cross-entropies over all positions.

Causal Masking

The autoregressive property is enforced via a triangular attention mask:

t=1t=2t=3t=4
Position 1
Position 2
Position 3
Position 4

Positions marked ✗ are set to −∞ before softmax, producing attention weight ≈ 0. This ensures that when computing the representation for position t, only tokens 1..t are visible.

Perplexity Benchmarks

ModelParametersPenn Treebank PPLNotes
5-gram (Kneser-Ney)141Classic n-gram baseline
LSTM (Merity et al., 2018)33M57.3State-of-art LSTM with fine-tuning
GPT-2 117M117M35.1Zero-shot; Radford et al. (2019)
Transformer-XL (Dai et al.)257M21.8Recurrence for long context

Teacher Forcing

During training, the model receives the true tokens as input at each position (not its own predictions). This technique, called “teacher forcing,” provides stable training gradients — if the model makes a prediction error, subsequent positions still receive the correct context. At inference time, the model must use its own predictions autoregressively.

See pre-training for the broader pre-training data and compute context, perplexity-metric for how perplexity is used for model evaluation, and temperature-sampling for how the probability distribution from next-token prediction is used to generate text.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is next-token prediction an effective pre-training objective?

Next-token prediction is a general-purpose objective — to predict what comes next in text, a model must implicitly learn syntax, semantics, factual relationships, reasoning patterns, and conversational structure. The training signal is derived entirely from the text itself (no human labels needed), enabling training on internet-scale data. Radford et al. (2019) demonstrated that GPT-2 acquires diverse capabilities (summarization, translation, QA) purely from next-token prediction.

What is the relationship between cross-entropy loss and perplexity?

The average negative log-likelihood (NLL) per token is the cross-entropy loss: CE = −(1/N) Σ log P(x_t | context). Perplexity is the exponential of this: PPL = exp(CE). A model with perplexity 35 assigns the correct next token an average probability of approximately 1/35 ≈ 2.9%. Lower perplexity indicates the model is better calibrated to the data distribution. Comparing perplexities across models requires identical tokenization.

Why can't the model attend to future tokens during training?

During pre-training with causal language modeling, the model must predict x_t using only x_1,...,x_{t-1}. If the model could attend to x_{t+1} when predicting x_t, the task becomes trivially easy — the answer is always in the context. The causal attention mask sets all attention weights from position t to positions > t to −∞ before softmax, effectively zeroing them out and enforcing this constraint. This mask also makes the architecture directly usable for autoregressive generation at inference time.

← All AI pages · Dashboard