Attention Mechanism

The attention mechanism is the core innovation of the Transformer. It allows each token in a sequence to compute a weighted combination of every other token’s representation, where the weights are determined by learned relevance scores. This replaced the sequential hidden-state propagation of RNNs with a fully parallelizable computation.

Scaled Dot-Product Attention

The fundamental operation. Given a sequence of N tokens, each represented as a d-dimensional embedding vector, we project them into three spaces:

  • Query (Q): “What am I looking for?”
  • Key (K): “What do I contain?”
  • Value (V): “What should I return?”

The attention output is:

Step by Step

  1. Project each token’s embedding through learned matrices , ,
  2. Compute pairwise scores via dot product of every Query with every Key — an matrix
  3. Scale by to prevent dot products from growing large (keeping variance ~1)
  4. Softmax normalizes scores into a probability distribution over the sequence
  5. Weighted sum of Value vectors produces each token’s output

The Database Analogy

Attention acts like a differentiable key-value store [6]:

  • Queries are search queries submitted to the database
  • Keys are the index labels on each stored item
  • Values are the items themselves
  • The output is a weighted combination of all values, weighted by query-key similarity

Multi-Head Attention

A single attention computation provides one “view.” Multi-Head Attention runs independent attention heads in parallel, each with distinct projection matrices:

Outputs are concatenated and projected:

Each head learns different linguistic phenomena — syntax, semantics, coreference, etc. Modern models use 32-128 heads.

Types of Attention

  • Self-attention: Q, K, V all from the same sequence (encoder blocks)
  • Causal/masked self-attention: Future tokens are masked out (decoder blocks)
  • Cross-attention: Q from decoder, K/V from encoder (encoder-decoder models)
  • Grouped-Query Attention (GQA): Multiple query heads share K/V heads — kv-cache optimization
  • Multi-Query Attention (MQA): All query heads share a single K/V head

Limitations

  1. O(n²) complexity: The attention matrix means compute and memory grow quadratically with sequence length — the primary bottleneck for long-context processing [5, 7]
  2. Softmax saturation: “Winner-take-all” dynamics cause vanishing gradients for non-dominant positions in deep stacks [9]
  3. Topological blind spots: Self-attention operates on flat sequences, struggling with hierarchical/tree-structured reasoning [10]
  4. Uncertainty quantification: Poorly calibrated probabilities under distribution shift [7]

References

  • Vaswani et al. — Attention Is All You Need (2017)
  • Serret — Understanding Transformers and Attention Mechanisms (arXiv 2604.00965, 2026)
  • Mondal & Jagtap — In Transformer We Trust? (arXiv 2602.14318, 2026)
  • Limitations of Normalization in Attention (OpenReview, 2025)
  • [10] The Topological Trouble With Transformers (arXiv 2604.17121, 2026)