Byte-Pair Encoding: Algorithm, Vocabulary Construction, and Byte-Level BPE

Category: representation Updated: 2026-02-27

BPE iteratively merges the most frequent adjacent byte pair until vocabulary reaches target size; GPT-2 applies BPE on raw UTF-8 bytes, producing 50,257 tokens with guaranteed full Unicode coverage and zero unknown tokens (Radford et al., 2019).

Key Data Points
MeasureValueUnitNotes
GPT-2 vocabulary size50,257tokensByte-level BPE; includes all 256 bytes as base tokens + 50,000 merges + special tokens
Base vocabulary (byte-level)256bytesAll UTF-8 bytes as initial symbols; zero unknown token issue
Number of BPE merge operations (GPT-2)50,000mergesEach merge combines two symbols into one new subword token
BPE time complexityO(V·I)V = vocabulary size, I = number of iterations; each iteration scans corpus once
Typical multilingual vocabulary250,000+tokensLarger vocabularies needed to cover diverse scripts without excessive character-level decomposition

Byte-pair encoding, originally a data compression algorithm, was adapted for neural machine translation by Sennrich et al. (2016) to address the open-vocabulary problem. It builds a vocabulary of subword units by greedily merging the most frequent adjacent symbol pairs from a training corpus.

The Algorithm

Initialization: Build a corpus as a frequency-weighted list of words, each split into characters with an end-of-word marker (e.g., “low” → “l o w ”).

Iteration:

  1. Count frequency of all adjacent symbol pairs in the corpus
  2. Find the most frequent pair (e.g., “e s” appearing 8,000 times)
  3. Merge that pair into a new symbol (“es”)
  4. Update the corpus vocabulary with the merged symbol
  5. Repeat until vocabulary reaches target size

Result: A vocabulary where common words appear as single tokens, and rare words decompose into several subword pieces.

Merge Priority Example

StepMost Frequent PairNew SymbolCorpus Change
1(“e”, “s”) 8,432דes""lowest” → “l o w es t”
2(“es”, “t”) 6,891דest""lowest” → “l o w est”
3(“l”, “o”) 5,234דlo""lowest” → “lo w est”
4(“lo”, “w”) 4,102דlow""lowest” → “low est”

After sufficient merges, common words like “the” and “lowest” are single tokens, while rare words like “electroencephalography” decompose into frequent morphemes.

Byte-Level BPE vs Character-Level BPE

PropertyCharacter-level BPEByte-level BPE
Base vocabularyUnicode characters (1,114,112 possible)256 UTF-8 bytes
Unknown tokensPossible for unusual UnicodeNone — every string is encodable
Sequence lengthSimilar to char-levelSlightly longer for non-ASCII
ImplementationSimplerRequires byte-to-char mapping
Example (GPT-2)256 bytes + 50,000 merges = 50,257 tokens

Vocabulary Size Tradeoffs

Vocabulary SizeProsCons
Small (16K)Short sequences, small embedding tableMore character-level splits for rare words
Medium (32K–50K)Balance for monolingual EnglishModerate sequence length
Large (100K+)Better coverage of morphology and codeLarge embedding/output matrices
Multilingual (250K+)All scripts covered efficientlyVery large embedding tables

See tokenization for the broader tokenization landscape, and word-embeddings for how these token indices are converted to dense continuous vectors for the transformer.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What problem does BPE solve compared to word-level tokenization?

Word-level tokenization requires assigning an [UNK] token to any word not seen during training — a critical problem for names, technical terms, and morphological variants. BPE avoids this by decomposing rare words into frequent subword pieces. 'internationalization' might not appear in the vocabulary, but 'inter', 'national', 'ization' do. Sennrich et al. (2016) showed BPE-based NMT systems achieve near-zero unknown token rates while keeping sequence length manageable.

What is the difference between character-level BPE and byte-level BPE?

Character-level BPE starts with Unicode code points as the initial vocabulary, which can number in the thousands for multilingual text. Byte-level BPE (introduced by Radford et al., 2019 for GPT-2) starts with raw UTF-8 bytes as the 256-token base vocabulary. Since all text is ultimately bytes, byte-level BPE guarantees zero unknown tokens regardless of language or script, at the cost of slightly longer sequences for some Unicode characters.

← All AI pages · Dashboard