Positional Encoding
Self-attention (see attention-mechanism) is permutation-invariant — it computes the same output regardless of token order. The model has no inherent notion of sequence position, so positional information must be injected separately.
Sinusoidal Positional Encoding (2017)
The original Transformer paper used fixed sinusoidal functions:
Advantages: no learned parameters, can extrapolate to longer sequences than seen in training. However, the absolute position signal proved suboptimal for many tasks.
Rotary Position Embedding (RoPE) — 2026 Standard
RoPE has become the de facto standard, adopted by Llama, Mistral, Gemma, and most modern models. It encodes position by rotating the Query and Key vectors by an angle proportional to token position:
- Each pair of dimensions is treated as a 2D vector and rotated
- The rotation angle increases with position
- Attention scores naturally decay between distant tokens
Why RoPE won:
- Provides a theoretically grounded relative position bias (attention depends on distance between tokens, not absolute position)
- Compatible with linear attention and kv-cache optimization
- Enables context extension techniques (YaRN, NTK-aware scaling)
- Works out-of-the-box with Grouped-Query Attention
Other Approaches
- ALiBi (Press et al., 2022): Adds a linear bias proportional to distance — simpler but less expressive
- Learned absolute embeddings: GPT-2/3 used learned position embeddings — cannot extrapolate
- RoPE + NoPE mixtures: Some models combine RoPE with no-position encodings for certain layers
Related Concepts
References
- Vaswani et al. — Attention Is All You Need (sinusoidal encoding)
- Jun Yu Tan — Crystallization of Transformer Architectures (RoPE adoption trajectory)