Tokenization: Subword Units, Vocabulary Size, and Characters Per Token

Name: Tokenization: Subword Units, Vocabulary Size, and Characters Per Token
Creator: AI Tower
Published: 2026-02-27

Category: representation Updated: 2026-02-27

Subword tokenization with BPE produces vocabularies of 32K–100K units; GPT-2's 50,257-token vocabulary averages ~4 characters per English token; a 1,000-word paragraph encodes to approximately 1,300 tokens.

Key Data Points
Measure	Value	Unit	Notes
GPT-2 vocabulary size	50,257	tokens	Byte-level BPE; exact size chosen so vocabulary is divisible efficiently by GPUs
Average characters per English token	~4	chars/token	Common words are single tokens; rare words split into multiple subword pieces
Typical sequence expansion	1.3×	relative	1,000 English words ≈ 1,300 tokens; depends on vocabulary and text domain
BPE vocabulary range	32K–100K	tokens	Smaller vocab → longer sequences; larger vocab → more parameters in embedding table
Embedding table size (50K vocab, 512 dim)	50,257 × 512 = 25.7M	parameters	Embedding matrix often largest single parameter block in small models

Tokenization is the process of converting raw text into a sequence of integer indices before feeding to a language model. The choice of tokenizer determines the vocabulary size, average sequence length, and how the model handles rare or out-of-vocabulary words.

Tokenization Strategies

Strategy	Unit	Vocab Size	Seq Length	OOV Handling
Word-level	Whole words	50K–500K	1.0× (words)	[UNK] token
Character-level	Individual chars	26–300	5–6× longer	None (all chars known)
Byte-level	UTF-8 bytes	256	4–5× longer	None (all bytes covered)
BPE (subword)	Learned merges	32K–100K	1.3× words	None (falls back to bytes/chars)
WordPiece	Likelihood merges	30K	1.3×	## prefix for continuations
SentencePiece	BPE or unigram	Configurable	1.3×	Full Unicode coverage

BPE Vocabulary Construction

Byte-pair encoding begins with individual characters (or bytes) as the initial vocabulary and iteratively merges the most frequent adjacent pair:

Start: vocabulary = set of all characters in corpus
Count all adjacent symbol pairs
Merge most frequent pair into a new symbol
Repeat until vocabulary reaches target size

With 50K merges on English text, the resulting vocabulary contains single tokens for common words (“the”, “and”, “is”), compound tokens for frequent morphemes (“-tion”, “-ing”), and multi-merge tokens for technical terms.

Token Count Examples

Text	Characters	Approximate Tokens	Ratio
”Hello world”	11	3 (“Hello”, ” world”)	3.7 chars/token
”the quick brown fox”	19	5	3.8 chars/token
”internationalization”	20	6 (subword splits)	3.3 chars/token
”Supercalifragilistic”	20	8 (rare word)	2.5 chars/token
Average English text	—	—	~4 chars/token

Impact on Context Window

Tokenization efficiency directly determines effective context: for a 4,096-token context window, a document of 4,096 × 4 chars ≈ 16,384 characters ≈ 3,000 words can be processed. For code (higher tokens-per-character due to identifiers), the same window holds fewer logical units.

See byte-pair-encoding for the algorithm in detail, context-window for how sequence length constrains model architecture, and word-embeddings for how token indices are mapped to dense vectors.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why use subword tokenization instead of word-level or character-level?

Word-level tokenization creates enormous vocabularies (500K+ words with proper nouns, inflections, misspellings) and treats unseen words as [UNK] tokens. Character-level tokenization produces very long sequences and struggles to capture semantic units. Subword tokenization (BPE, WordPiece, SentencePiece) finds a middle ground: frequent words become single tokens, rare words split into meaningful subpiece components. Sennrich et al. (2016) showed BPE-based systems achieve near-zero out-of-vocabulary rates while keeping sequences manageable.

How does vocabulary size affect model behavior?

Larger vocabularies reduce sequence length (fewer tokens per character), which reduces attention computation (quadratic in length). But larger vocabularies mean larger embedding tables and output projection layers. At 50K tokens and d_model=1024, the embedding/unembedding matrices alone contain 50K × 1024 ≈ 51M parameters. There is an empirical sweet spot around 32K–100K tokens for English text; multilingual models often use 250K+ to cover diverse scripts.

← All AI pages · Dashboard