Question 1

Why does the context window matter for language models?

Accepted Answer

The context window determines how much prior text the model can 'see' when generating each token. Tasks like long-document summarization, code analysis across files, and multi-turn conversation require large context windows. A model processing its 5,000th token can only use information from within the context window — earlier tokens are effectively forgotten if they exceed the limit.

Question 2

What limits context window size?

Accepted Answer

Two factors: memory and compute. The attention weight matrix is n×n, requiring O(n²) memory. For n=32K tokens at fp16 per attention head, the attention matrix alone is 32K × 32K × 2 bytes ≈ 2GB per head. The KV cache (cached key-value pairs for all previous tokens) grows linearly with sequence length but can reach tens of GB for long sequences with many layers.

Question 3

How do modern positional encodings enable longer context?

Accepted Answer

Sinusoidal encodings (2017) work for lengths up to those seen during training but degrade on longer inputs. RoPE (rotary position embedding) applies rotation matrices that naturally generalize to longer sequences. ALiBi (Attention with Linear Biases) applies a linear bias to attention scores based on distance, trained on short contexts but generalizing to longer ones with minimal degradation. Both enable the 'train short, test long' paradigm.

Measure	Value	Unit	Notes
Original transformer (2017)	512	tokens	Limited by O(n²) attention memory at training time
Attention memory scaling	O(n²)		4× longer sequence = 16× more attention memory; dominant cost at long context
KV cache size (n_layers=32, n_heads=32, d_head=128, n=4096, fp16)	2 × 32 × 32 × 128 × 4096 × 2 bytes ≈ 2.1 GB	bytes	KV cache dominates memory at long contexts; grows linearly with sequence length
RoPE extrapolation	Tested to 8× training length	relative	Su et al. (2024): RoPE maintains performance well beyond training context length
Sparse attention memory reduction	O(n·√n)		Child et al. (2019) Sparse Transformer; reduces quadratic bottleneck

Year	Architecture	Context Window	Positional Method
2017	Original Transformer	512 tokens	Sinusoidal PE
2018	BERT	512 tokens	Learned PE
2019	GPT-2	1,024 tokens	Learned PE
2020	GPT-3	2,048 tokens	Learned PE
2021	Longformer	4,096–32K	Sliding window
2022	ALiBi-based models	4K–16K+	ALiBi bias
2023+	RoPE-based models	4K–128K+	RoPE
2024+	Extended architectures	128K–1M	RoPE + YaRN/LongRoPE

Sequence Length	Attention Matrix (fp16, 1 head)	KV Cache (32 layers, 32 heads, d=128, fp16)
512	512² × 2B = 0.5 MB	32×32×128×512×2×2B = 67 MB
4,096	4096² × 2B = 33.6 MB	32×32×128×4096×2×2B = 536 MB
32,768	32768² × 2B = 2.1 GB	32×32×128×32768×2×2B = 4.3 GB
131,072	131K² × 2B = 34.4 GB	32×32×128×131K×2×2B = 17 GB

Method	Memory Scaling	Performance Beyond Training Length
Dense attention (original)	O(n²)	Poor extrapolation
Sparse attention (Child et al.)	O(n·√n)	Moderate
Sliding window attention	O(n·w) — w=window size	Good for local patterns
RoPE (Su et al.)	O(n²) dense	Good extrapolation
ALiBi (Press et al.)	O(n²) dense	Strong extrapolation

Context Window: Sequence Length Limits, Positional Methods, and Long-Context Extensions

Context Window Evolution

Memory Cost of Long Contexts

Approaches to Extending Context

Related Pages

Sources

Frequently Asked Questions

Why does the context window matter for language models?

What limits context window size?

How do modern positional encodings enable longer context?