KV Cache & Efficient Attention

When a transformer generates text autoregressively (one token at a time), a naive implementation recomputes all Keys and Values for every previous token at each step. The KV cache stores the projected K and V matrices from earlier steps, recomputing only the current token’s K and V — making inference computationally feasible.

The KV Cache Problem

KV cache size grows as:

$2 \times seq_len \times d_{m o d e l} \times n_{l a y er s} \times bytes_per_param$

For a 70B model with 200K context, the cache can exceed 100 GB — larger than the model weights themselves [13]. This is the primary memory bottleneck in long-context inference.

Efficient Attention Variants

Multi-Head Attention (MHA) — Original

Each attention head has its own Q, K, V projections. Cache = $n_{h e a d s} \times$ full.

Multi-Query Attention (MQA)

All heads share a single K and V head. KV cache drops by factor of $n_{h e a d s}$ with minimal quality loss. Used by Falcon, PaLM [15].

Grouped-Query Attention (GQA) — 2026 Standard

A compromise: heads are divided into groups, each group shares K/V projections. E.g., 8 query heads, 2 KV groups → 4× cache reduction.

Why GQA won: Better quality than MQA, simpler than MLA, good hardware utilization. Used by Llama 3, Mistral, Gemma 3 [3, 13, 15].

Multi-head Latent Attention (MLA)

DeepSeek V2’s innovation: compress K and V into a low-dimensional latent space. Achieves ~93% KV-cache compression versus standard MHA. More complex kernel, but enables massive context windows [13].

Other Optimization Techniques

KV cache quantization: Store cache at FP8 or INT4 (2-4× compression)
KV cache eviction: Smartly drop unimportant cache entries (H2O, StreamingLLM)
Ring attention / distributed cache: Shard cache across multiple GPUs
Sparse attention patterns: Only compute attention for a subset of token pairs

References

Tan — Crystallization of Transformer Architectures (GQA as consensus choice)
[13] Multi-Segment Attention (arXiv 2606.02964, 2026)
[15] Grouped Query Attention — The Sweet Spot Between Quality and Efficiency (2026)

Talos Research Wiki

Explorer

KV Cache & Efficient Attention

KV Cache & Efficient Attention

The KV Cache Problem

Efficient Attention Variants

Multi-Head Attention (MHA) — Original

Multi-Query Attention (MQA)

Grouped-Query Attention (GQA) — 2026 Standard

Multi-head Latent Attention (MLA)

Other Optimization Techniques

References

Graph View

Table of Contents

Backlinks

Talos Research Wiki

Explorer

KV Cache & Efficient Attention

KV Cache & Efficient Attention

The KV Cache Problem

Efficient Attention Variants

Multi-Head Attention (MHA) — Original

Multi-Query Attention (MQA)

Grouped-Query Attention (GQA) — 2026 Standard

Multi-head Latent Attention (MLA)

Other Optimization Techniques

Related Concepts

References

Graph View

Table of Contents

Backlinks