Attention Mechanism
The attention mechanism is the core innovation of the Transformer. It allows each token in a sequence to compute a weighted combination of every other token’s representation, where the weights are determined by learned relevance scores. This replaced the sequential hidden-state propagation of RNNs with a fully parallelizable computation.
Scaled Dot-Product Attention
The fundamental operation. Given a sequence of N tokens, each represented as a d-dimensional embedding vector, we project them into three spaces:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What should I return?”
The attention output is:
Step by Step
- Project each token’s embedding through learned matrices , ,
- Compute pairwise scores via dot product of every Query with every Key — an matrix
- Scale by to prevent dot products from growing large (keeping variance ~1)
- Softmax normalizes scores into a probability distribution over the sequence
- Weighted sum of Value vectors produces each token’s output
The Database Analogy
Attention acts like a differentiable key-value store [6]:
- Queries are search queries submitted to the database
- Keys are the index labels on each stored item
- Values are the items themselves
- The output is a weighted combination of all values, weighted by query-key similarity
Multi-Head Attention
A single attention computation provides one “view.” Multi-Head Attention runs independent attention heads in parallel, each with distinct projection matrices:
Outputs are concatenated and projected:
Each head learns different linguistic phenomena — syntax, semantics, coreference, etc. Modern models use 32-128 heads.
Types of Attention
- Self-attention: Q, K, V all from the same sequence (encoder blocks)
- Causal/masked self-attention: Future tokens are masked out (decoder blocks)
- Cross-attention: Q from decoder, K/V from encoder (encoder-decoder models)
- Grouped-Query Attention (GQA): Multiple query heads share K/V heads — kv-cache optimization
- Multi-Query Attention (MQA): All query heads share a single K/V head
Limitations
- O(n²) complexity: The attention matrix means compute and memory grow quadratically with sequence length — the primary bottleneck for long-context processing [5, 7]
- Softmax saturation: “Winner-take-all” dynamics cause vanishing gradients for non-dominant positions in deep stacks [9]
- Topological blind spots: Self-attention operates on flat sequences, struggling with hierarchical/tree-structured reasoning [10]
- Uncertainty quantification: Poorly calibrated probabilities under distribution shift [7]
Related Concepts
References
- Vaswani et al. — Attention Is All You Need (2017)
- Serret — Understanding Transformers and Attention Mechanisms (arXiv 2604.00965, 2026)
- Mondal & Jagtap — In Transformer We Trust? (arXiv 2602.14318, 2026)
- Limitations of Normalization in Attention (OpenReview, 2025)
- [10] The Topological Trouble With Transformers (arXiv 2604.17121, 2026)