KV Cache & Efficient Attention

When a transformer generates text autoregressively (one token at a time), a naive implementation recomputes all Keys and Values for every previous token at each step. The KV cache stores the projected K and V matrices from earlier steps, recomputing only the current token’s K and V — making inference computationally feasible.

The KV Cache Problem

KV cache size grows as:

For a 70B model with 200K context, the cache can exceed 100 GB — larger than the model weights themselves [13]. This is the primary memory bottleneck in long-context inference.

Efficient Attention Variants

Multi-Head Attention (MHA) — Original

Each attention head has its own Q, K, V projections. Cache = full.

Multi-Query Attention (MQA)

All heads share a single K and V head. KV cache drops by factor of with minimal quality loss. Used by Falcon, PaLM [15].

Grouped-Query Attention (GQA) — 2026 Standard

A compromise: heads are divided into groups, each group shares K/V projections. E.g., 8 query heads, 2 KV groups → 4× cache reduction.

Why GQA won: Better quality than MQA, simpler than MLA, good hardware utilization. Used by Llama 3, Mistral, Gemma 3 [3, 13, 15].

Multi-head Latent Attention (MLA)

DeepSeek V2’s innovation: compress K and V into a low-dimensional latent space. Achieves ~93% KV-cache compression versus standard MHA. More complex kernel, but enables massive context windows [13].

Other Optimization Techniques

  • KV cache quantization: Store cache at FP8 or INT4 (2-4× compression)
  • KV cache eviction: Smartly drop unimportant cache entries (H2O, StreamingLLM)
  • Ring attention / distributed cache: Shard cache across multiple GPUs
  • Sparse attention patterns: Only compute attention for a subset of token pairs

References

  • Tan — Crystallization of Transformer Architectures (GQA as consensus choice)
  • [13] Multi-Segment Attention (arXiv 2606.02964, 2026)
  • [15] Grouped Query Attention — The Sweet Spot Between Quality and Efficiency (2026)