KV Cache & Efficient Attention
When a transformer generates text autoregressively (one token at a time), a naive implementation recomputes all Keys and Values for every previous token at each step. The KV cache stores the projected K and V matrices from earlier steps, recomputing only the current token’s K and V — making inference computationally feasible.
The KV Cache Problem
KV cache size grows as:
For a 70B model with 200K context, the cache can exceed 100 GB — larger than the model weights themselves [13]. This is the primary memory bottleneck in long-context inference.
Efficient Attention Variants
Multi-Head Attention (MHA) — Original
Each attention head has its own Q, K, V projections. Cache = full.
Multi-Query Attention (MQA)
All heads share a single K and V head. KV cache drops by factor of with minimal quality loss. Used by Falcon, PaLM [15].
Grouped-Query Attention (GQA) — 2026 Standard
A compromise: heads are divided into groups, each group shares K/V projections. E.g., 8 query heads, 2 KV groups → 4× cache reduction.
Why GQA won: Better quality than MQA, simpler than MLA, good hardware utilization. Used by Llama 3, Mistral, Gemma 3 [3, 13, 15].
Multi-head Latent Attention (MLA)
DeepSeek V2’s innovation: compress K and V into a low-dimensional latent space. Achieves ~93% KV-cache compression versus standard MHA. More complex kernel, but enables massive context windows [13].
Other Optimization Techniques
- KV cache quantization: Store cache at FP8 or INT4 (2-4× compression)
- KV cache eviction: Smartly drop unimportant cache entries (H2O, StreamingLLM)
- Ring attention / distributed cache: Shard cache across multiple GPUs
- Sparse attention patterns: Only compute attention for a subset of token pairs
Related Concepts
References
- Tan — Crystallization of Transformer Architectures (GQA as consensus choice)
- [13] Multi-Segment Attention (arXiv 2606.02964, 2026)
- [15] Grouped Query Attention — The Sweet Spot Between Quality and Efficiency (2026)