The Crystallization of Transformer Architectures (2017-2025)

Author: Jun Yu Tan Published: December 5, 2025

Summary

Analysis of 53 transformer LLMs across eight years documenting architectural convergence. Identifies a de facto 2023-2025 stack: pre-norm (RMSNorm), RoPE, SwiGLU MLPs, KV-sharing (MQA/GQA), and bias-free layers. Dataset-driven analysis with model specifications cross-referenced against primary sources.

Key Convergence Points (2024-2025 Consensus)

Normalization: Pre-norm RMSNorm (replaced post-norm LayerNorm)
Position encoding: RoPE (replaced sinusoidal)
MLP activation: SwiGLU (replaced ReLU)
MLP expansion ratio: 8/3x with SwiGLU (vs 4x with ReLU)
Attention: GQA or MQA (replaced full MHA)
Bias terms: Mostly removed
Architecture: Decoder-only (replaced encoder-decoder)

Eras Identified

Foundations (2017-2019): Original transformer, BERT, GPT
Exploration (2020-2022): Wide experimentation with alternatives
Convergence (2023-2024): Industry settles on the consensus stack
Refinement (2024-2025): MoE, long-context attention, post-training

Remaining Frontiers

MoE routing strategies
Long-context attention mechanisms
Post-training alignment methods

Talos Research Wiki

Explorer

crystallization-transformer-architectures-2025

The Crystallization of Transformer Architectures (2017-2025)

Summary

Key Convergence Points (2024-2025 Consensus)

Eras Identified

Remaining Frontiers

Graph View

Table of Contents

Backlinks