Autoregressive Decoding: Greedy, Beam Search, and Sampling Strategies
Beam search (Sutskever et al., 2014) maintains k candidate sequences by cumulative log-probability; top-p nucleus sampling (Holtzman et al., 2020) dynamically selects the minimum token set covering probability mass p.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Greedy decoding quality | Repetitive | Holtzman et al.: greedy and beam search produce degenerate repetitive text in open-ended generation | |
| Nucleus sampling p value | 0.9–0.95 | cumulative probability | Holtzman et al. (2020): top-p=0.9 produces the most human-like text; filters low-probability tail |
| Top-k sampling k value | 40–50 | top tokens | Fan et al. (2018): k=40 balances diversity and coherence for story generation |
| Beam search width | 4–8 | beams | Machine translation: beam=4 typically near-optimal; larger beams decrease quality for open generation |
| Temperature effect | T→0 = greedy, T→∞ = uniform | Scaling logits by 1/T before softmax; T=0.7–1.0 standard for instruction following |
Autoregressive decoding is the inference procedure for generating text with language models. The model generates tokens sequentially: at each step, the full sequence of previously generated tokens is fed as input, and the model outputs a probability distribution over the vocabulary. A decoding strategy selects the next token from this distribution. The process repeats until an end-of-sequence token is sampled or a maximum length is reached.
The Core Loop
1. Input: prompt tokens [t_1, ..., t_n]
2. Forward pass → logits z ∈ ℝ^V (V = vocab size)
3. Apply decoding strategy → select t_{n+1}
4. Append t_{n+1} to context
5. Repeat from step 2
Each forward pass has O(n · d_model²) compute for the attention and FFN layers (or O(1) amortized with KV cache — see kv-cache).
Decoding Strategies
Greedy Decoding
Always select the highest-probability token: t = argmax_v p(v|context)
- Deterministic; no randomness
- Maximizes token-level probability at each step
- Does not maximize sequence-level probability
- Produces repetitive text in open-ended settings
Beam Search
Maintain k candidate sequences (“beams”), scored by cumulative log-probability:
score(t_1, …, t_n) = Σ log p(t_i | t_1, …, t_{i-1})
At each step, expand each beam by all vocabulary tokens, keep the top-k sequences by score, and discard the rest.
| Beam width | Compute cost | Quality (MT) | Quality (open-gen) |
|---|---|---|---|
| 1 (greedy) | 1× | Good | Poor |
| 4 | 4× | Best | Worse than sampling |
| 8 | 8× | ~Equal to 4 | Degenerates |
For constrained tasks (translation, summarization) beam search often outperforms sampling. For open-ended generation, sampling dominates.
Temperature Sampling
Scale logits before softmax: p_T(v) = softmax(z / T)
| Temperature T | Distribution | Behavior |
|---|---|---|
| T → 0 | Peaked (argmax) | Greedy; deterministic |
| T = 0.7 | Slightly peaked | Focused; good for factual tasks |
| T = 1.0 | Unchanged | Original model distribution |
| T = 1.5 | Flattened | More random; creative |
| T → ∞ | Uniform | Uniform random selection |
Top-k Sampling
Sample only from the k highest-probability tokens, renormalize, then sample:
p_k(v) ∝ p(v) if v ∈ top-k, else 0
Problem: k=50 may include very low-probability tokens when the distribution is peaked, or exclude many reasonable options when the distribution is flat.
Nucleus (Top-p) Sampling
Select the minimal set of tokens covering probability mass ≥ p:
V_p = argmin_{V’ ⊆ V} |V’| s.t. Σ_{v ∈ V’} p(v) ≥ p
Sample from V_p with renormalized probabilities. The set size adapts dynamically to the distribution shape.
| p value | Behavior |
|---|---|
| p = 1.0 | Full distribution (no truncation) |
| p = 0.95 | Slight truncation of low-prob tail |
| p = 0.9 | Standard; Holtzman et al. recommendation |
| p = 0.5 | Aggressive truncation; focused but less diverse |
Repetition Penalties
To reduce degenerate repetition, a repetition penalty modifies logits for recently generated tokens:
z’_v = z_v / penalty if token v appears in context, else z_v
With penalty > 1.0, previously generated tokens are de-emphasized. Typical values: 1.1–1.3.
See kv-cache for how inference is made efficient via caching, speculative-decoding for a technique that accelerates autoregressive generation, and softmax-function for the probability computation at each decoding step.
Related Pages
Sources
- Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration. ICLR 2020
- Sutskever et al. (2014) — Sequence to Sequence Learning with Neural Networks. NeurIPS 2014
- Fan et al. (2018) — Hierarchical Neural Story Generation. ACL 2018
Frequently Asked Questions
Why does beam search produce worse open-ended text than sampling?
Holtzman et al. (2020) showed that beam search maximizes the probability of the entire sequence, but high-probability sequences are not the same as human-like sequences. Human text is not drawn from the maximum-probability mode of the distribution — humans write surprising, diverse, and contextually varied text. Beam search collapses to repetitive high-probability phrases because these dominate the probability mass. Sampling methods that truncate the distribution (top-p, top-k) better match the diversity of natural human text.
What is top-p (nucleus) sampling and how does it differ from top-k?
Top-k sampling selects from the k highest-probability tokens, regardless of how much probability mass they cover. Top-p (nucleus) sampling selects the minimum set of tokens whose cumulative probability ≥ p. At each step, the token set size varies dynamically: if one token has 95% probability, the nucleus has size 1; if 100 tokens each have 1% probability, the nucleus has size ~95. This dynamic size is the key advantage: top-p adapts to the model's certainty at each position, while top-k uses a fixed cutoff regardless of distribution shape.