The Crystallization of Transformer Architectures (2017-2025)
Author: Jun Yu Tan Published: December 5, 2025
Summary
Analysis of 53 transformer LLMs across eight years documenting architectural convergence. Identifies a de facto 2023-2025 stack: pre-norm (RMSNorm), RoPE, SwiGLU MLPs, KV-sharing (MQA/GQA), and bias-free layers. Dataset-driven analysis with model specifications cross-referenced against primary sources.
Key Convergence Points (2024-2025 Consensus)
- Normalization: Pre-norm RMSNorm (replaced post-norm LayerNorm)
- Position encoding: RoPE (replaced sinusoidal)
- MLP activation: SwiGLU (replaced ReLU)
- MLP expansion ratio: 8/3x with SwiGLU (vs 4x with ReLU)
- Attention: GQA or MQA (replaced full MHA)
- Bias terms: Mostly removed
- Architecture: Decoder-only (replaced encoder-decoder)
Eras Identified
- Foundations (2017-2019): Original transformer, BERT, GPT
- Exploration (2020-2022): Wide experimentation with alternatives
- Convergence (2023-2024): Industry settles on the consensus stack
- Refinement (2024-2025): MoE, long-context attention, post-training
Remaining Frontiers
- MoE routing strategies
- Long-context attention mechanisms
- Post-training alignment methods