The Crystallization of Transformer Architectures (2017-2025)

Author: Jun Yu Tan Published: December 5, 2025

Summary

Analysis of 53 transformer LLMs across eight years documenting architectural convergence. Identifies a de facto 2023-2025 stack: pre-norm (RMSNorm), RoPE, SwiGLU MLPs, KV-sharing (MQA/GQA), and bias-free layers. Dataset-driven analysis with model specifications cross-referenced against primary sources.

Key Convergence Points (2024-2025 Consensus)

  • Normalization: Pre-norm RMSNorm (replaced post-norm LayerNorm)
  • Position encoding: RoPE (replaced sinusoidal)
  • MLP activation: SwiGLU (replaced ReLU)
  • MLP expansion ratio: 8/3x with SwiGLU (vs 4x with ReLU)
  • Attention: GQA or MQA (replaced full MHA)
  • Bias terms: Mostly removed
  • Architecture: Decoder-only (replaced encoder-decoder)

Eras Identified

  1. Foundations (2017-2019): Original transformer, BERT, GPT
  2. Exploration (2020-2022): Wide experimentation with alternatives
  3. Convergence (2023-2024): Industry settles on the consensus stack
  4. Refinement (2024-2025): MoE, long-context attention, post-training

Remaining Frontiers

  • MoE routing strategies
  • Long-context attention mechanisms
  • Post-training alignment methods