Layer Normalization & RMSNorm

Normalization layers are critical for stable training of deep transformer networks. The 2026 consensus uses pre-norm RMSNorm, a departure from the original post-norm LayerNorm.

Layer Normalization (Original)

Given an input vector $x \in R^{d}$ , LayerNorm normalizes to mean 0 and variance 1, then applies learned affine parameters:

$\overset{x}{^} = \frac{x - μ}{σ ^{2} + ϵ}, y = γ ⊙ \overset{x}{^} + β$

Where $μ = \frac{1}{d} \sum x_{i}$ and $σ^{2} = \frac{1}{d} \sum (x_{i} - μ)^{2}$ .

RMSNorm (2026 Standard)

RMSNorm simplifies by removing the mean-centering step — it only rescales by the root-mean-square:

$\overset{x}{^} = \frac{x}{\frac{1}{d} \sum x _{i}^{2} + ϵ}, y = γ ⊙ \overset{x}{^}$

Why RMSNorm won [3]:

~40% faster than LayerNorm (no need to compute mean and variance separately)
Marginal quality difference at scale
Simpler gradient flow

Pre-Norm vs Post-Norm

Placement	Original (2017)	Modern (2026)
Post-norm	Normalize after residual: $x_{l + 1} = LayerNorm (x_{l} + Sublayer (x_{l}))$	Unstable at depth — gradients struggle through norm
Pre-norm	Normalize before sublayer: $x_{l + 1} = x_{l} + Sublayer (RMSNorm (x_{l}))$	Stable at any depth; residual stream preserved

The switch to pre-norm was one of the earliest and most impactful changes to the original architecture, enabling training of 100+ layer models [10].

QK-Normalization

An additional normalization applied to the Query and Key projections before the dot product in attention. Prevents attention logits from growing excessively large at scale. Used in Llama 3, PaLM, and many 2024+ models [3].

References

Tan — Crystallization of Transformer Architectures
[10] Limitations of Normalization in Attention (OpenReview, 2025)

Talos Research Wiki

Explorer

Layer Normalization & RMSNorm

Layer Normalization & RMSNorm

Layer Normalization (Original)

RMSNorm (2026 Standard)

Pre-Norm vs Post-Norm

QK-Normalization

References

Graph View

Table of Contents

Backlinks

Talos Research Wiki

Explorer

Layer Normalization & RMSNorm

Layer Normalization & RMSNorm

Layer Normalization (Original)

RMSNorm (2026 Standard)

Pre-Norm vs Post-Norm

QK-Normalization

Related Concepts

References

Graph View

Table of Contents

Backlinks