Context Window: Sequence Length Limits, Positional Methods, and Long-Context Extensions

Category: representation Updated: 2026-02-27

Original transformer context window: 512 tokens; self-attention memory scales as O(n²), making long contexts expensive; RoPE enables extrapolation beyond training length; some architectures reach 1M-token context windows.

Key Data Points
MeasureValueUnitNotes
Original transformer (2017)512tokensLimited by O(n²) attention memory at training time
Attention memory scalingO(n²)4× longer sequence = 16× more attention memory; dominant cost at long context
KV cache size (n_layers=32, n_heads=32, d_head=128, n=4096, fp16)2 × 32 × 32 × 128 × 4096 × 2 bytes ≈ 2.1 GBbytesKV cache dominates memory at long contexts; grows linearly with sequence length
RoPE extrapolationTested to 8× training lengthrelativeSu et al. (2024): RoPE maintains performance well beyond training context length
Sparse attention memory reductionO(n·√n)Child et al. (2019) Sparse Transformer; reduces quadratic bottleneck

The context window defines the maximum number of tokens a transformer can process in a single forward pass. Every token in the window can potentially attend to every other token via self-attention; tokens outside the window are inaccessible. This constraint determines what tasks the model can perform in a single pass.

Context Window Evolution

YearArchitectureContext WindowPositional Method
2017Original Transformer512 tokensSinusoidal PE
2018BERT512 tokensLearned PE
2019GPT-21,024 tokensLearned PE
2020GPT-32,048 tokensLearned PE
2021Longformer4,096–32KSliding window
2022ALiBi-based models4K–16K+ALiBi bias
2023+RoPE-based models4K–128K+RoPE
2024+Extended architectures128K–1MRoPE + YaRN/LongRoPE

Memory Cost of Long Contexts

For full (dense) attention, memory costs at different sequence lengths:

Sequence LengthAttention Matrix (fp16, 1 head)KV Cache (32 layers, 32 heads, d=128, fp16)
512512² × 2B = 0.5 MB32×32×128×512×2×2B = 67 MB
4,0964096² × 2B = 33.6 MB32×32×128×4096×2×2B = 536 MB
32,76832768² × 2B = 2.1 GB32×32×128×32768×2×2B = 4.3 GB
131,072131K² × 2B = 34.4 GB32×32×128×131K×2×2B = 17 GB

Approaches to Extending Context

MethodMemory ScalingPerformance Beyond Training Length
Dense attention (original)O(n²)Poor extrapolation
Sparse attention (Child et al.)O(n·√n)Moderate
Sliding window attentionO(n·w) — w=window sizeGood for local patterns
RoPE (Su et al.)O(n²) denseGood extrapolation
ALiBi (Press et al.)O(n²) denseStrong extrapolation

See kv-cache for how key-value pairs are stored during inference to avoid recomputation, and positional-encoding for the mathematical details of sinusoidal and rotary encoding methods.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does the context window matter for language models?

The context window determines how much prior text the model can 'see' when generating each token. Tasks like long-document summarization, code analysis across files, and multi-turn conversation require large context windows. A model processing its 5,000th token can only use information from within the context window — earlier tokens are effectively forgotten if they exceed the limit.

What limits context window size?

Two factors: memory and compute. The attention weight matrix is n×n, requiring O(n²) memory. For n=32K tokens at fp16 per attention head, the attention matrix alone is 32K × 32K × 2 bytes ≈ 2GB per head. The KV cache (cached key-value pairs for all previous tokens) grows linearly with sequence length but can reach tens of GB for long sequences with many layers.

How do modern positional encodings enable longer context?

Sinusoidal encodings (2017) work for lengths up to those seen during training but degrade on longer inputs. RoPE (rotary position embedding) applies rotation matrices that naturally generalize to longer sequences. ALiBi (Attention with Linear Biases) applies a linear bias to attention scores based on distance, trained on short contexts but generalizing to longer ones with minimal degradation. Both enable the 'train short, test long' paradigm.

← All AI pages · Dashboard