Autoregressive Decoding: Greedy, Beam Search, and Sampling Strategies

Name: Autoregressive Decoding: Greedy, Beam Search, and Sampling Strategies
Creator: AI Tower
Published: 2026-02-27

Category: inference Updated: 2026-02-27

Beam search (Sutskever et al., 2014) maintains k candidate sequences by cumulative log-probability; top-p nucleus sampling (Holtzman et al., 2020) dynamically selects the minimum token set covering probability mass p.

Key Data Points
Measure	Value	Unit	Notes
Greedy decoding quality	Repetitive		Holtzman et al.: greedy and beam search produce degenerate repetitive text in open-ended generation
Nucleus sampling p value	0.9–0.95	cumulative probability	Holtzman et al. (2020): top-p=0.9 produces the most human-like text; filters low-probability tail
Top-k sampling k value	40–50	top tokens	Fan et al. (2018): k=40 balances diversity and coherence for story generation
Beam search width	4–8	beams	Machine translation: beam=4 typically near-optimal; larger beams decrease quality for open generation
Temperature effect	T→0 = greedy, T→∞ = uniform		Scaling logits by 1/T before softmax; T=0.7–1.0 standard for instruction following

Autoregressive decoding is the inference procedure for generating text with language models. The model generates tokens sequentially: at each step, the full sequence of previously generated tokens is fed as input, and the model outputs a probability distribution over the vocabulary. A decoding strategy selects the next token from this distribution. The process repeats until an end-of-sequence token is sampled or a maximum length is reached.

The Core Loop

1. Input: prompt tokens [t_1, ..., t_n]
2. Forward pass → logits z ∈ ℝ^V (V = vocab size)
3. Apply decoding strategy → select t_{n+1}
4. Append t_{n+1} to context
5. Repeat from step 2

Each forward pass has O(n · d_model²) compute for the attention and FFN layers (or O(1) amortized with KV cache — see kv-cache).

Decoding Strategies

Greedy Decoding

Always select the highest-probability token: t = argmax_v p(v|context)

Deterministic; no randomness
Maximizes token-level probability at each step
Does not maximize sequence-level probability
Produces repetitive text in open-ended settings

Beam Search

Maintain k candidate sequences (“beams”), scored by cumulative log-probability:

score(t_1, …, t_n) = Σ log p(t_i | t_1, …, t_{i-1})

At each step, expand each beam by all vocabulary tokens, keep the top-k sequences by score, and discard the rest.

Beam width	Compute cost	Quality (MT)	Quality (open-gen)
1 (greedy)	1×	Good	Poor
4	4×	Best	Worse than sampling
8	8×	~Equal to 4	Degenerates

For constrained tasks (translation, summarization) beam search often outperforms sampling. For open-ended generation, sampling dominates.

Temperature Sampling

Scale logits before softmax: p_T(v) = softmax(z / T)

Temperature T	Distribution	Behavior
T → 0	Peaked (argmax)	Greedy; deterministic
T = 0.7	Slightly peaked	Focused; good for factual tasks
T = 1.0	Unchanged	Original model distribution
T = 1.5	Flattened	More random; creative
T → ∞	Uniform	Uniform random selection

Top-k Sampling

Sample only from the k highest-probability tokens, renormalize, then sample:

p_k(v) ∝ p(v) if v ∈ top-k, else 0

Problem: k=50 may include very low-probability tokens when the distribution is peaked, or exclude many reasonable options when the distribution is flat.

Nucleus (Top-p) Sampling

Select the minimal set of tokens covering probability mass ≥ p:

V_p = argmin_{V’ ⊆ V} |V’| s.t. Σ_{v ∈ V’} p(v) ≥ p

Sample from V_p with renormalized probabilities. The set size adapts dynamically to the distribution shape.

p value	Behavior
p = 1.0	Full distribution (no truncation)
p = 0.95	Slight truncation of low-prob tail
p = 0.9	Standard; Holtzman et al. recommendation
p = 0.5	Aggressive truncation; focused but less diverse

Repetition Penalties

To reduce degenerate repetition, a repetition penalty modifies logits for recently generated tokens:

z’_v = z_v / penalty if token v appears in context, else z_v

With penalty > 1.0, previously generated tokens are de-emphasized. Typical values: 1.1–1.3.

See kv-cache for how inference is made efficient via caching, speculative-decoding for a technique that accelerates autoregressive generation, and softmax-function for the probability computation at each decoding step.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does beam search produce worse open-ended text than sampling?

Holtzman et al. (2020) showed that beam search maximizes the probability of the entire sequence, but high-probability sequences are not the same as human-like sequences. Human text is not drawn from the maximum-probability mode of the distribution — humans write surprising, diverse, and contextually varied text. Beam search collapses to repetitive high-probability phrases because these dominate the probability mass. Sampling methods that truncate the distribution (top-p, top-k) better match the diversity of natural human text.

What is top-p (nucleus) sampling and how does it differ from top-k?

Top-k sampling selects from the k highest-probability tokens, regardless of how much probability mass they cover. Top-p (nucleus) sampling selects the minimum set of tokens whose cumulative probability ≥ p. At each step, the token set size varies dynamically: if one token has 95% probability, the nucleus has size 1; if 100 tokens each have 1% probability, the nucleus has size ~95. This dynamic size is the key advantage: top-p adapts to the model's certainty at each position, while top-k uses a fixed cutoff regardless of distribution shape.

← All AI pages · Dashboard