Question 1

Why is a KV cache necessary for autoregressive inference?

Accepted Answer

During autoregressive generation, the model generates one token at a time. Without caching, to generate token t it would need to recompute the attention keys and values for all t−1 previous tokens in every layer — scaling as O(t) per new token. With the KV cache, keys and values are computed once and stored; generating each new token only adds O(1) new KV pairs, making the per-token inference cost constant rather than growing with sequence length.

Question 2

How large does the KV cache get in practice?

Accepted Answer

KV cache size = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_element. For a medium-scale model (32 layers, 32 heads, d_head=128) running at fp16 with a 4K token context, the KV cache is ~536 MB. At 128K tokens, the same model requires ~17 GB of KV cache alone — often exceeding the weight memory at short contexts. This is why long-context inference requires careful memory management.

Question 3

What is Multi-Query Attention and how does it reduce KV cache?

Accepted Answer

Standard multi-head attention (MHA) maintains separate K and V projections for each of the h heads. Multi-Query Attention (MQA, Shazeer 2019) uses a single shared K and V projection across all query heads, reducing KV cache size by factor h. Grouped Query Attention (GQA, Ainslie et al. 2023) is a middle ground with G groups (G < h shared KV heads), reducing cache size by h/G while retaining most of MHA's quality.

Measure	Value	Unit	Notes
KV cache memory formula	2 × n_layers × n_heads × d_head × seq_len × dtype_bytes	bytes	Factor 2 for keys + values; doubles per additional layer and head
Example: 32 layers, 32 heads, d_head=128, 4096 tokens, fp16	536	MB	2 × 32 × 32 × 128 × 4096 × 2 = 536,870,912 bytes ≈ 536 MB
Inference FLOPs without KV cache (token t)	O(t·d²)		Must recompute all previous K, V from scratch for each new token
Inference FLOPs with KV cache (token t)	O(d²)		Only compute K, V for the new token; attend over cached K, V for all prior tokens
Multi-Query Attention KV size reduction	8–32×	reduction	MQA/GQA uses 1 or G < h KV heads shared across query heads; reduces KV cache proportionally

Sequence Length	KV Cache Size (fp16)	KV Cache Size (int8)
512	67 MB	33 MB
2,048	268 MB	134 MB
4,096	536 MB	268 MB
32,768	4.3 GB	2.1 GB
131,072	17.2 GB	8.6 GB

Technique	KV Heads	Cache Reduction	Quality Impact
Multi-Head Attention (MHA)	h per layer	1× (baseline)	Full quality
Multi-Query Attention (MQA)	1 per layer	h×	Minor quality loss
Grouped Query Attention (GQA)	G per layer (G<h)	h/G ×	Near-MHA quality
PagedAttention	—	No size change	Reduces fragmentation
KV cache quantization	—	2–4×	<1% quality loss

KV Cache: Key-Value Caching for Efficient Autoregressive Inference

How Autoregressive Inference Works

KV Cache Memory Breakdown

KV Cache Reduction Techniques

Related Pages

Sources

Frequently Asked Questions

Why is a KV cache necessary for autoregressive inference?

How large does the KV cache get in practice?

What is Multi-Query Attention and how does it reduce KV cache?