Layer Normalization & RMSNorm

Normalization layers are critical for stable training of deep transformer networks. The 2026 consensus uses pre-norm RMSNorm, a departure from the original post-norm LayerNorm.

Layer Normalization (Original)

Given an input vector , LayerNorm normalizes to mean 0 and variance 1, then applies learned affine parameters:

Where and .

RMSNorm (2026 Standard)

RMSNorm simplifies by removing the mean-centering step — it only rescales by the root-mean-square:

Why RMSNorm won [3]:

  • ~40% faster than LayerNorm (no need to compute mean and variance separately)
  • Marginal quality difference at scale
  • Simpler gradient flow

Pre-Norm vs Post-Norm

PlacementOriginal (2017)Modern (2026)
Post-normNormalize after residual: Unstable at depth — gradients struggle through norm
Pre-normNormalize before sublayer: Stable at any depth; residual stream preserved

The switch to pre-norm was one of the earliest and most impactful changes to the original architecture, enabling training of 100+ layer models [10].

QK-Normalization

An additional normalization applied to the Query and Key projections before the dot product in attention. Prevents attention logits from growing excessively large at scale. Used in Llama 3, PaLM, and many 2024+ models [3].

References

  • Tan — Crystallization of Transformer Architectures
  • [10] Limitations of Normalization in Attention (OpenReview, 2025)