Tokenization: Subword Units, Vocabulary Size, and Characters Per Token
Subword tokenization with BPE produces vocabularies of 32K–100K units; GPT-2's 50,257-token vocabulary averages ~4 characters per English token; a 1,000-word paragraph encodes to approximately 1,300 tokens.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GPT-2 vocabulary size | 50,257 | tokens | Byte-level BPE; exact size chosen so vocabulary is divisible efficiently by GPUs |
| Average characters per English token | ~4 | chars/token | Common words are single tokens; rare words split into multiple subword pieces |
| Typical sequence expansion | 1.3× | relative | 1,000 English words ≈ 1,300 tokens; depends on vocabulary and text domain |
| BPE vocabulary range | 32K–100K | tokens | Smaller vocab → longer sequences; larger vocab → more parameters in embedding table |
| Embedding table size (50K vocab, 512 dim) | 50,257 × 512 = 25.7M | parameters | Embedding matrix often largest single parameter block in small models |
Tokenization is the process of converting raw text into a sequence of integer indices before feeding to a language model. The choice of tokenizer determines the vocabulary size, average sequence length, and how the model handles rare or out-of-vocabulary words.
Tokenization Strategies
| Strategy | Unit | Vocab Size | Seq Length | OOV Handling |
|---|---|---|---|---|
| Word-level | Whole words | 50K–500K | 1.0× (words) | [UNK] token |
| Character-level | Individual chars | 26–300 | 5–6× longer | None (all chars known) |
| Byte-level | UTF-8 bytes | 256 | 4–5× longer | None (all bytes covered) |
| BPE (subword) | Learned merges | 32K–100K | 1.3× words | None (falls back to bytes/chars) |
| WordPiece | Likelihood merges | 30K | 1.3× | ## prefix for continuations |
| SentencePiece | BPE or unigram | Configurable | 1.3× | Full Unicode coverage |
BPE Vocabulary Construction
Byte-pair encoding begins with individual characters (or bytes) as the initial vocabulary and iteratively merges the most frequent adjacent pair:
- Start: vocabulary = set of all characters in corpus
- Count all adjacent symbol pairs
- Merge most frequent pair into a new symbol
- Repeat until vocabulary reaches target size
With 50K merges on English text, the resulting vocabulary contains single tokens for common words (“the”, “and”, “is”), compound tokens for frequent morphemes (“-tion”, “-ing”), and multi-merge tokens for technical terms.
Token Count Examples
| Text | Characters | Approximate Tokens | Ratio |
|---|---|---|---|
| ”Hello world” | 11 | 3 (“Hello”, ” world”) | 3.7 chars/token |
| ”the quick brown fox” | 19 | 5 | 3.8 chars/token |
| ”internationalization” | 20 | 6 (subword splits) | 3.3 chars/token |
| ”Supercalifragilistic” | 20 | 8 (rare word) | 2.5 chars/token |
| Average English text | — | — | ~4 chars/token |
Impact on Context Window
Tokenization efficiency directly determines effective context: for a 4,096-token context window, a document of 4,096 × 4 chars ≈ 16,384 characters ≈ 3,000 words can be processed. For code (higher tokens-per-character due to identifiers), the same window holds fewer logical units.
See byte-pair-encoding for the algorithm in detail, context-window for how sequence length constrains model architecture, and word-embeddings for how token indices are mapped to dense vectors.
Related Pages
Sources
- Sennrich et al. (2016) — Neural Machine Translation of Rare Words with Subword Units. ACL 2016
- Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI Technical Report
- Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. EMNLP 2018
Frequently Asked Questions
Why use subword tokenization instead of word-level or character-level?
Word-level tokenization creates enormous vocabularies (500K+ words with proper nouns, inflections, misspellings) and treats unseen words as [UNK] tokens. Character-level tokenization produces very long sequences and struggles to capture semantic units. Subword tokenization (BPE, WordPiece, SentencePiece) finds a middle ground: frequent words become single tokens, rare words split into meaningful subpiece components. Sennrich et al. (2016) showed BPE-based systems achieve near-zero out-of-vocabulary rates while keeping sequences manageable.
How does vocabulary size affect model behavior?
Larger vocabularies reduce sequence length (fewer tokens per character), which reduces attention computation (quadratic in length). But larger vocabularies mean larger embedding tables and output projection layers. At 50K tokens and d_model=1024, the embedding/unembedding matrices alone contain 50K × 1024 ≈ 51M parameters. There is an empirical sweet spot around 32K–100K tokens for English text; multilingual models often use 250K+ to cover diverse scripts.