Tokenization: Subword Units, Vocabulary Size, and Characters Per Token

Category: representation Updated: 2026-02-27

Subword tokenization with BPE produces vocabularies of 32K–100K units; GPT-2's 50,257-token vocabulary averages ~4 characters per English token; a 1,000-word paragraph encodes to approximately 1,300 tokens.

Key Data Points
MeasureValueUnitNotes
GPT-2 vocabulary size50,257tokensByte-level BPE; exact size chosen so vocabulary is divisible efficiently by GPUs
Average characters per English token~4chars/tokenCommon words are single tokens; rare words split into multiple subword pieces
Typical sequence expansion1.3×relative1,000 English words ≈ 1,300 tokens; depends on vocabulary and text domain
BPE vocabulary range32K–100KtokensSmaller vocab → longer sequences; larger vocab → more parameters in embedding table
Embedding table size (50K vocab, 512 dim)50,257 × 512 = 25.7MparametersEmbedding matrix often largest single parameter block in small models

Tokenization is the process of converting raw text into a sequence of integer indices before feeding to a language model. The choice of tokenizer determines the vocabulary size, average sequence length, and how the model handles rare or out-of-vocabulary words.

Tokenization Strategies

StrategyUnitVocab SizeSeq LengthOOV Handling
Word-levelWhole words50K–500K1.0× (words)[UNK] token
Character-levelIndividual chars26–3005–6× longerNone (all chars known)
Byte-levelUTF-8 bytes2564–5× longerNone (all bytes covered)
BPE (subword)Learned merges32K–100K1.3× wordsNone (falls back to bytes/chars)
WordPieceLikelihood merges30K1.3×## prefix for continuations
SentencePieceBPE or unigramConfigurable1.3×Full Unicode coverage

BPE Vocabulary Construction

Byte-pair encoding begins with individual characters (or bytes) as the initial vocabulary and iteratively merges the most frequent adjacent pair:

  1. Start: vocabulary = set of all characters in corpus
  2. Count all adjacent symbol pairs
  3. Merge most frequent pair into a new symbol
  4. Repeat until vocabulary reaches target size

With 50K merges on English text, the resulting vocabulary contains single tokens for common words (“the”, “and”, “is”), compound tokens for frequent morphemes (“-tion”, “-ing”), and multi-merge tokens for technical terms.

Token Count Examples

TextCharactersApproximate TokensRatio
”Hello world”113 (“Hello”, ” world”)3.7 chars/token
”the quick brown fox”1953.8 chars/token
”internationalization”206 (subword splits)3.3 chars/token
”Supercalifragilistic”208 (rare word)2.5 chars/token
Average English text~4 chars/token

Impact on Context Window

Tokenization efficiency directly determines effective context: for a 4,096-token context window, a document of 4,096 × 4 chars ≈ 16,384 characters ≈ 3,000 words can be processed. For code (higher tokens-per-character due to identifiers), the same window holds fewer logical units.

See byte-pair-encoding for the algorithm in detail, context-window for how sequence length constrains model architecture, and word-embeddings for how token indices are mapped to dense vectors.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why use subword tokenization instead of word-level or character-level?

Word-level tokenization creates enormous vocabularies (500K+ words with proper nouns, inflections, misspellings) and treats unseen words as [UNK] tokens. Character-level tokenization produces very long sequences and struggles to capture semantic units. Subword tokenization (BPE, WordPiece, SentencePiece) finds a middle ground: frequent words become single tokens, rare words split into meaningful subpiece components. Sennrich et al. (2016) showed BPE-based systems achieve near-zero out-of-vocabulary rates while keeping sequences manageable.

How does vocabulary size affect model behavior?

Larger vocabularies reduce sequence length (fewer tokens per character), which reduces attention computation (quadratic in length). But larger vocabularies mean larger embedding tables and output projection layers. At 50K tokens and d_model=1024, the embedding/unembedding matrices alone contain 50K × 1024 ≈ 51M parameters. There is an empirical sweet spot around 32K–100K tokens for English text; multilingual models often use 250K+ to cover diverse scripts.

← All AI pages · Dashboard