Attention Mechanism

The attention mechanism is the core innovation of the Transformer. It allows each token in a sequence to compute a weighted combination of every other token’s representation, where the weights are determined by learned relevance scores. This replaced the sequential hidden-state propagation of RNNs with a fully parallelizable computation.

Scaled Dot-Product Attention

The fundamental operation. Given a sequence of N tokens, each represented as a d-dimensional embedding vector, we project them into three spaces:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What should I return?”

The attention output is:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Step by Step

Project each token’s embedding through learned matrices $W_{Q}$ , $W_{K}$ , $W_{V}$
Compute pairwise scores via dot product of every Query with every Key — an $N \times N$ matrix
Scale by $\frac{1}{d _{k}}$ to prevent dot products from growing large (keeping variance ~1)
Softmax normalizes scores into a probability distribution over the sequence
Weighted sum of Value vectors produces each token’s output

The Database Analogy

Attention acts like a differentiable key-value store [6]:

Queries are search queries submitted to the database
Keys are the index labels on each stored item
Values are the items themselves
The output is a weighted combination of all values, weighted by query-key similarity

Multi-Head Attention

A single attention computation provides one “view.” Multi-Head Attention runs $h$ independent attention heads in parallel, each with distinct projection matrices:

$head_{i} = Attention (X W_{Q}^{(i)}, X W_{K}^{(i)}, X W_{V}^{(i)})$

Outputs are concatenated and projected: $MHA (X) = Concat (head_{1}, ..., head_{h}) W_{O}$

Each head learns different linguistic phenomena — syntax, semantics, coreference, etc. Modern models use 32-128 heads.

Types of Attention

Self-attention: Q, K, V all from the same sequence (encoder blocks)
Causal/masked self-attention: Future tokens are masked out (decoder blocks)
Cross-attention: Q from decoder, K/V from encoder (encoder-decoder models)
Grouped-Query Attention (GQA): Multiple query heads share K/V heads — kv-cache optimization
Multi-Query Attention (MQA): All query heads share a single K/V head

Limitations

O(n²) complexity: The $N \times N$ attention matrix means compute and memory grow quadratically with sequence length — the primary bottleneck for long-context processing [5, 7]
Softmax saturation: “Winner-take-all” dynamics cause vanishing gradients for non-dominant positions in deep stacks [9]
Topological blind spots: Self-attention operates on flat sequences, struggling with hierarchical/tree-structured reasoning [10]
Uncertainty quantification: Poorly calibrated probabilities under distribution shift [7]

References

Vaswani et al. — Attention Is All You Need (2017)
Serret — Understanding Transformers and Attention Mechanisms (arXiv 2604.00965, 2026)
Mondal & Jagtap — In Transformer We Trust? (arXiv 2602.14318, 2026)
Limitations of Normalization in Attention (OpenReview, 2025)
[10] The Topological Trouble With Transformers (arXiv 2604.17121, 2026)

Talos Research Wiki

Explorer

Attention Mechanism

Attention Mechanism

Scaled Dot-Product Attention

Step by Step

The Database Analogy

Multi-Head Attention

Types of Attention

Limitations

References

Graph View

Table of Contents

Backlinks

Talos Research Wiki

Explorer

Attention Mechanism

Attention Mechanism

Scaled Dot-Product Attention

Step by Step

The Database Analogy

Multi-Head Attention

Types of Attention

Limitations

Related Concepts

References

Graph View

Table of Contents

Backlinks