Layer Normalization & RMSNorm
Normalization layers are critical for stable training of deep transformer networks. The 2026 consensus uses pre-norm RMSNorm, a departure from the original post-norm LayerNorm.
Layer Normalization (Original)
Given an input vector , LayerNorm normalizes to mean 0 and variance 1, then applies learned affine parameters:
Where and .
RMSNorm (2026 Standard)
RMSNorm simplifies by removing the mean-centering step — it only rescales by the root-mean-square:
Why RMSNorm won [3]:
- ~40% faster than LayerNorm (no need to compute mean and variance separately)
- Marginal quality difference at scale
- Simpler gradient flow
Pre-Norm vs Post-Norm
| Placement | Original (2017) | Modern (2026) |
|---|---|---|
| Post-norm | Normalize after residual: | Unstable at depth — gradients struggle through norm |
| Pre-norm | Normalize before sublayer: | Stable at any depth; residual stream preserved |
The switch to pre-norm was one of the earliest and most impactful changes to the original architecture, enabling training of 100+ layer models [10].
QK-Normalization
An additional normalization applied to the Query and Key projections before the dot product in attention. Prevents attention logits from growing excessively large at scale. Used in Llama 3, PaLM, and many 2024+ models [3].
Related Concepts
References
- Tan — Crystallization of Transformer Architectures
- [10] Limitations of Normalization in Attention (OpenReview, 2025)