Beam Search: Approximate Sequence Optimization in Autoregressive Models

Name: Beam Search: Approximate Sequence Optimization in Autoregressive Models
Creator: AI Tower
Published: 2026-02-27

Category: inference Updated: 2026-02-27

Beam search maintains k hypotheses; at each step expands k×|V| continuations, retains top-k by Σ log P(y_t|y_{<t}); Sutskever et al. (NeurIPS 2014) used k=2–12 for seq2seq MT; Holtzman et al. (2020) showed beam-decoded text has lower human preference than nucleus-sampled text for open generation.

Key Data Points
Measure	Value	Unit	Notes
Standard beam width for neural MT	k = 4–5	beams	Sutskever et al. (2014): k=2 already improves significantly over greedy; k>10 yields diminishing returns
Compute cost vs greedy	O(k × \|V\|) per step		Each step evaluates k beams × full vocabulary \|V\|; memory scales as k × T × d_model
Length penalty (Google NMT)	lp(Y) = (5 + \|Y\|)^α / 6^α, α=0.6–0.7		Wu et al. (2016): without penalty, beam search favors short sequences; α corrects length bias
BLEU improvement from greedy to beam k=5	+1.5–2.0 BLEU	BLEU points	Typical gain on WMT English-German benchmarks; beam k=5 vs k=1 greedy baseline

Beam search is an approximate search algorithm for finding high-probability output sequences in autoregressive models. Rather than exhaustively enumerating all |V|^T possible sequences (exponential in length T), it maintains a fixed-size set of k hypotheses at each step, pruning while preserving higher-quality candidates than greedy decoding.

Algorithm

At each decoding step t with k active beams:

For each beam b_i (i=1..k), compute logits over full vocabulary |V|
Generate k × |V| candidate extensions
Score all candidates by cumulative log-probability: score(y_{1:t}) = Σ_{j=1}^{t} log P(y_j | y_{<j}, x)
Retain top-k by score
Continue until all k beams reach EOS token or max length

Final output: highest-scoring completed beam (optionally after length normalization).

Beam Width vs BLEU Score (WMT English-German)

Beam Width (k)	BLEU Score	Notes
1 (greedy)	~26.0	Baseline; misses many good sequences
2	~27.2	Large gain from minimal additional cost
5	~28.0–28.4	Standard; most MT systems use k=4–5
10	~28.3–28.5	Diminishing returns
50	~28.3	Sometimes degrades due to length bias

Length Normalization

Without length correction, beam search favors short sequences (fewer terms in the cumulative sum). The Google NMT length penalty (Wu et al., 2016):

lp(Y) = (5 + |Y|)^α / 6^α

with α=0.6–0.7 normalizes scores by sequence length. Cover penalty and block n-gram repetition penalties are additional regularizations used in practice.

When Beam Search Works and When It Fails

Task	Beam Search	Stochastic Sampling
Machine translation	Effective (k=4–5)	Worse (adds noise to precise translation)
Abstractive summarization	Effective (k=4)	Comparable at high quality
Open-ended story generation	Degenerate (repetitive)	Preferred (nucleus/temperature)
Dialogue response	Dull, generic	More engaging
Structured output (SQL, JSON)	Preferred (deterministic)	Risk of invalid structure

The Text Degeneration Problem

In open-ended generation, beam search produces repeating n-gram sequences. The failure mechanism: once a high-probability phrase appears, its continuation has high conditional probability, so beam search reinforces the repetition. N-gram blocking (suppress beams containing repeated n-grams of length ≥ 3) partially mitigates this at the cost of hard vocabulary constraints.

See kv-cache for how past attention states are cached during beam search, temperature-sampling for the stochastic alternative, and autoregressive-decoding for the token-by-token generation framework.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why does beam search fail for open-ended text generation?

Beam search finds sequences with high joint probability under the model. For machine translation with a clear reference, maximizing probability aligns with quality. For open-ended generation, high-probability sequences are often generic, repetitive, and boring — the model's probability distribution over creative text is broad, and the highest-probability path corresponds to safe, uninteresting continuations. Holtzman et al. (2020) showed that joint log-probability of human-written text is not higher than degenerate repetitive sequences under typical language models.

What is the difference between beam search and greedy decoding?

Greedy decoding selects the single highest-probability token at each step — equivalent to beam search with k=1. This can miss sequences where a slightly lower-probability first token leads to a much higher-probability overall sequence. Beam search explores k alternatives simultaneously, pruning the search space to retain the k highest-probability partial sequences at each step. The trade-off: k-beam search requires k× the computation and memory of greedy decoding, with diminishing accuracy returns above k=5–10.

← All AI pages · Dashboard