Beam Search: Approximate Sequence Optimization in Autoregressive Models
Beam search maintains k hypotheses; at each step expands k×|V| continuations, retains top-k by Σ log P(y_t|y_{<t}); Sutskever et al. (NeurIPS 2014) used k=2–12 for seq2seq MT; Holtzman et al. (2020) showed beam-decoded text has lower human preference than nucleus-sampled text for open generation.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Standard beam width for neural MT | k = 4–5 | beams | Sutskever et al. (2014): k=2 already improves significantly over greedy; k>10 yields diminishing returns |
| Compute cost vs greedy | O(k × |V|) per step | Each step evaluates k beams × full vocabulary |V|; memory scales as k × T × d_model | |
| Length penalty (Google NMT) | lp(Y) = (5 + |Y|)^α / 6^α, α=0.6–0.7 | Wu et al. (2016): without penalty, beam search favors short sequences; α corrects length bias | |
| BLEU improvement from greedy to beam k=5 | +1.5–2.0 BLEU | BLEU points | Typical gain on WMT English-German benchmarks; beam k=5 vs k=1 greedy baseline |
Beam search is an approximate search algorithm for finding high-probability output sequences in autoregressive models. Rather than exhaustively enumerating all |V|^T possible sequences (exponential in length T), it maintains a fixed-size set of k hypotheses at each step, pruning while preserving higher-quality candidates than greedy decoding.
Algorithm
At each decoding step t with k active beams:
- For each beam b_i (i=1..k), compute logits over full vocabulary |V|
- Generate k × |V| candidate extensions
- Score all candidates by cumulative log-probability: score(y_{1:t}) = Σ_{j=1}^{t} log P(y_j | y_{<j}, x)
- Retain top-k by score
- Continue until all k beams reach EOS token or max length
Final output: highest-scoring completed beam (optionally after length normalization).
Beam Width vs BLEU Score (WMT English-German)
| Beam Width (k) | BLEU Score | Notes |
|---|---|---|
| 1 (greedy) | ~26.0 | Baseline; misses many good sequences |
| 2 | ~27.2 | Large gain from minimal additional cost |
| 5 | ~28.0–28.4 | Standard; most MT systems use k=4–5 |
| 10 | ~28.3–28.5 | Diminishing returns |
| 50 | ~28.3 | Sometimes degrades due to length bias |
Length Normalization
Without length correction, beam search favors short sequences (fewer terms in the cumulative sum). The Google NMT length penalty (Wu et al., 2016):
lp(Y) = (5 + |Y|)^α / 6^α
with α=0.6–0.7 normalizes scores by sequence length. Cover penalty and block n-gram repetition penalties are additional regularizations used in practice.
When Beam Search Works and When It Fails
| Task | Beam Search | Stochastic Sampling |
|---|---|---|
| Machine translation | Effective (k=4–5) | Worse (adds noise to precise translation) |
| Abstractive summarization | Effective (k=4) | Comparable at high quality |
| Open-ended story generation | Degenerate (repetitive) | Preferred (nucleus/temperature) |
| Dialogue response | Dull, generic | More engaging |
| Structured output (SQL, JSON) | Preferred (deterministic) | Risk of invalid structure |
The Text Degeneration Problem
In open-ended generation, beam search produces repeating n-gram sequences. The failure mechanism: once a high-probability phrase appears, its continuation has high conditional probability, so beam search reinforces the repetition. N-gram blocking (suppress beams containing repeated n-grams of length ≥ 3) partially mitigates this at the cost of hard vocabulary constraints.
See kv-cache for how past attention states are cached during beam search, temperature-sampling for the stochastic alternative, and autoregressive-decoding for the token-by-token generation framework.
Related Pages
Sources
- Sutskever et al. (2014) — Sequence to Sequence Learning with Neural Networks. NeurIPS 2014
- Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration. ICLR 2020
- Wu et al. (2016) — Google's Neural Machine Translation System. arXiv
Frequently Asked Questions
Why does beam search fail for open-ended text generation?
Beam search finds sequences with high joint probability under the model. For machine translation with a clear reference, maximizing probability aligns with quality. For open-ended generation, high-probability sequences are often generic, repetitive, and boring — the model's probability distribution over creative text is broad, and the highest-probability path corresponds to safe, uninteresting continuations. Holtzman et al. (2020) showed that joint log-probability of human-written text is not higher than degenerate repetitive sequences under typical language models.
What is the difference between beam search and greedy decoding?
Greedy decoding selects the single highest-probability token at each step — equivalent to beam search with k=1. This can miss sequences where a slightly lower-probability first token leads to a much higher-probability overall sequence. Beam search explores k alternatives simultaneously, pruning the search space to retain the k highest-probability partial sequences at each step. The trade-off: k-beam search requires k× the computation and memory of greedy decoding, with diminishing accuracy returns above k=5–10.