Autoregressive Decoding: Greedy, Beam Search, and Sampling Strategies

Category: inference Updated: 2026-02-27

Beam search (Sutskever et al., 2014) maintains k candidate sequences by cumulative log-probability; top-p nucleus sampling (Holtzman et al., 2020) dynamically selects the minimum token set covering probability mass p.

Key Data Points
MeasureValueUnitNotes
Greedy decoding qualityRepetitiveHoltzman et al.: greedy and beam search produce degenerate repetitive text in open-ended generation
Nucleus sampling p value0.9–0.95cumulative probabilityHoltzman et al. (2020): top-p=0.9 produces the most human-like text; filters low-probability tail
Top-k sampling k value40–50top tokensFan et al. (2018): k=40 balances diversity and coherence for story generation
Beam search width4–8beamsMachine translation: beam=4 typically near-optimal; larger beams decrease quality for open generation
Temperature effectT→0 = greedy, T→∞ = uniformScaling logits by 1/T before softmax; T=0.7–1.0 standard for instruction following

Autoregressive decoding is the inference procedure for generating text with language models. The model generates tokens sequentially: at each step, the full sequence of previously generated tokens is fed as input, and the model outputs a probability distribution over the vocabulary. A decoding strategy selects the next token from this distribution. The process repeats until an end-of-sequence token is sampled or a maximum length is reached.

The Core Loop

1. Input: prompt tokens [t_1, ..., t_n]
2. Forward pass → logits z ∈ ℝ^V (V = vocab size)
3. Apply decoding strategy → select t_{n+1}
4. Append t_{n+1} to context
5. Repeat from step 2

Each forward pass has O(n · d_model²) compute for the attention and FFN layers (or O(1) amortized with KV cache — see kv-cache).

Decoding Strategies

Greedy Decoding

Always select the highest-probability token: t = argmax_v p(v|context)

  • Deterministic; no randomness
  • Maximizes token-level probability at each step
  • Does not maximize sequence-level probability
  • Produces repetitive text in open-ended settings

Maintain k candidate sequences (“beams”), scored by cumulative log-probability:

score(t_1, …, t_n) = Σ log p(t_i | t_1, …, t_{i-1})

At each step, expand each beam by all vocabulary tokens, keep the top-k sequences by score, and discard the rest.

Beam widthCompute costQuality (MT)Quality (open-gen)
1 (greedy)GoodPoor
4BestWorse than sampling
8~Equal to 4Degenerates

For constrained tasks (translation, summarization) beam search often outperforms sampling. For open-ended generation, sampling dominates.

Temperature Sampling

Scale logits before softmax: p_T(v) = softmax(z / T)

Temperature TDistributionBehavior
T → 0Peaked (argmax)Greedy; deterministic
T = 0.7Slightly peakedFocused; good for factual tasks
T = 1.0UnchangedOriginal model distribution
T = 1.5FlattenedMore random; creative
T → ∞UniformUniform random selection

Top-k Sampling

Sample only from the k highest-probability tokens, renormalize, then sample:

p_k(v) ∝ p(v) if v ∈ top-k, else 0

Problem: k=50 may include very low-probability tokens when the distribution is peaked, or exclude many reasonable options when the distribution is flat.

Nucleus (Top-p) Sampling

Select the minimal set of tokens covering probability mass ≥ p:

V_p = argmin_{V’ ⊆ V} |V’| s.t. Σ_{v ∈ V’} p(v) ≥ p

Sample from V_p with renormalized probabilities. The set size adapts dynamically to the distribution shape.

p valueBehavior
p = 1.0Full distribution (no truncation)
p = 0.95Slight truncation of low-prob tail
p = 0.9Standard; Holtzman et al. recommendation
p = 0.5Aggressive truncation; focused but less diverse

Repetition Penalties

To reduce degenerate repetition, a repetition penalty modifies logits for recently generated tokens:

z’_v = z_v / penalty if token v appears in context, else z_v

With penalty > 1.0, previously generated tokens are de-emphasized. Typical values: 1.1–1.3.

See kv-cache for how inference is made efficient via caching, speculative-decoding for a technique that accelerates autoregressive generation, and softmax-function for the probability computation at each decoding step.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does beam search produce worse open-ended text than sampling?

Holtzman et al. (2020) showed that beam search maximizes the probability of the entire sequence, but high-probability sequences are not the same as human-like sequences. Human text is not drawn from the maximum-probability mode of the distribution — humans write surprising, diverse, and contextually varied text. Beam search collapses to repetitive high-probability phrases because these dominate the probability mass. Sampling methods that truncate the distribution (top-p, top-k) better match the diversity of natural human text.

What is top-p (nucleus) sampling and how does it differ from top-k?

Top-k sampling selects from the k highest-probability tokens, regardless of how much probability mass they cover. Top-p (nucleus) sampling selects the minimum set of tokens whose cumulative probability ≥ p. At each step, the token set size varies dynamically: if one token has 95% probability, the nucleus has size 1; if 100 tokens each have 1% probability, the nucleus has size ~95. This dynamic size is the key advantage: top-p adapts to the model's certainty at each position, while top-k uses a fixed cutoff regardless of distribution shape.

← All AI pages · Dashboard