Context Window: Sequence Length Limits, Positional Methods, and Long-Context Extensions
Original transformer context window: 512 tokens; self-attention memory scales as O(n²), making long contexts expensive; RoPE enables extrapolation beyond training length; some architectures reach 1M-token context windows.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Original transformer (2017) | 512 | tokens | Limited by O(n²) attention memory at training time |
| Attention memory scaling | O(n²) | 4× longer sequence = 16× more attention memory; dominant cost at long context | |
| KV cache size (n_layers=32, n_heads=32, d_head=128, n=4096, fp16) | 2 × 32 × 32 × 128 × 4096 × 2 bytes ≈ 2.1 GB | bytes | KV cache dominates memory at long contexts; grows linearly with sequence length |
| RoPE extrapolation | Tested to 8× training length | relative | Su et al. (2024): RoPE maintains performance well beyond training context length |
| Sparse attention memory reduction | O(n·√n) | Child et al. (2019) Sparse Transformer; reduces quadratic bottleneck |
The context window defines the maximum number of tokens a transformer can process in a single forward pass. Every token in the window can potentially attend to every other token via self-attention; tokens outside the window are inaccessible. This constraint determines what tasks the model can perform in a single pass.
Context Window Evolution
| Year | Architecture | Context Window | Positional Method |
|---|---|---|---|
| 2017 | Original Transformer | 512 tokens | Sinusoidal PE |
| 2018 | BERT | 512 tokens | Learned PE |
| 2019 | GPT-2 | 1,024 tokens | Learned PE |
| 2020 | GPT-3 | 2,048 tokens | Learned PE |
| 2021 | Longformer | 4,096–32K | Sliding window |
| 2022 | ALiBi-based models | 4K–16K+ | ALiBi bias |
| 2023+ | RoPE-based models | 4K–128K+ | RoPE |
| 2024+ | Extended architectures | 128K–1M | RoPE + YaRN/LongRoPE |
Memory Cost of Long Contexts
For full (dense) attention, memory costs at different sequence lengths:
| Sequence Length | Attention Matrix (fp16, 1 head) | KV Cache (32 layers, 32 heads, d=128, fp16) |
|---|---|---|
| 512 | 512² × 2B = 0.5 MB | 32×32×128×512×2×2B = 67 MB |
| 4,096 | 4096² × 2B = 33.6 MB | 32×32×128×4096×2×2B = 536 MB |
| 32,768 | 32768² × 2B = 2.1 GB | 32×32×128×32768×2×2B = 4.3 GB |
| 131,072 | 131K² × 2B = 34.4 GB | 32×32×128×131K×2×2B = 17 GB |
Approaches to Extending Context
| Method | Memory Scaling | Performance Beyond Training Length |
|---|---|---|
| Dense attention (original) | O(n²) | Poor extrapolation |
| Sparse attention (Child et al.) | O(n·√n) | Moderate |
| Sliding window attention | O(n·w) — w=window size | Good for local patterns |
| RoPE (Su et al.) | O(n²) dense | Good extrapolation |
| ALiBi (Press et al.) | O(n²) dense | Strong extrapolation |
See kv-cache for how key-value pairs are stored during inference to avoid recomputation, and positional-encoding for the mathematical details of sinusoidal and rotary encoding methods.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Su et al. (2024) — RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing 2024
- Press et al. (2022) — Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ICLR 2022
- Child et al. (2019) — Generating Long Sequences with Sparse Transformers. arXiv
Frequently Asked Questions
Why does the context window matter for language models?
The context window determines how much prior text the model can 'see' when generating each token. Tasks like long-document summarization, code analysis across files, and multi-turn conversation require large context windows. A model processing its 5,000th token can only use information from within the context window — earlier tokens are effectively forgotten if they exceed the limit.
What limits context window size?
Two factors: memory and compute. The attention weight matrix is n×n, requiring O(n²) memory. For n=32K tokens at fp16 per attention head, the attention matrix alone is 32K × 32K × 2 bytes ≈ 2GB per head. The KV cache (cached key-value pairs for all previous tokens) grows linearly with sequence length but can reach tens of GB for long sequences with many layers.
How do modern positional encodings enable longer context?
Sinusoidal encodings (2017) work for lengths up to those seen during training but degrade on longer inputs. RoPE (rotary position embedding) applies rotation matrices that naturally generalize to longer sequences. ALiBi (Attention with Linear Biases) applies a linear bias to attention scores based on distance, trained on short contexts but generalizing to longer ones with minimal degradation. Both enable the 'train short, test long' paradigm.