Transformer Architecture
The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper “Attention Is All You Need” that replaces recurrence (RNNs/LSTMs) entirely with attention mechanisms. As of 2026, it is the foundational architecture behind every major LLM — GPT, Claude, Gemini, Llama, Mistral, and DeepSeek.
Core Idea
Instead of processing tokens sequentially (one at a time, like RNNs), a transformer allows every token in a sequence to directly attend to every other token simultaneously via self-attention. This parallelism made transformers dramatically faster to train and better at capturing long-range dependencies.
The 2026 Consensus Stack
Eight years of iteration converged on a remarkably consistent set of design choices across 53 major models studied in crystallization-transformer-architectures-2025:
| Component | 2017 Original | 2026 Consensus | Why Changed |
|---|---|---|---|
| Normalization | Post-norm LayerNorm | Pre-norm RMSNorm | Pre-norm stabilizes deep training; RMSNorm is ~40% faster |
| Position Encoding | Sinusoidal | [[positional-encoding | RoPE]] |
| MLP Activation | ReLU | SwiGLU | 25-40% better quality-per-FLOP |
| MLP Expansion | 4× | 8/3× (with SwiGLU) | Compensates for gate parameters |
| Attention Heads | Full MHA | [[kv-cache | GQA or MQA]] |
| Bias Terms | Present | Mostly absent | Simplifies, improves transfer |
| Architecture | Encoder-Decoder | Decoder-only | Sufficient for generative tasks |
Block Structure
A modern transformer decoder block (2026 style) performs:
- RMSNorm on input
- Multi-head self-attention (with GQA and RoPE)
- Residual connection: add input back
- RMSNorm on result
- SwiGLU feed-forward network (8/3× expansion)
- Residual connection: add input back
This block repeats 32-128 times depending on model scale.
Encoder vs Decoder vs Encoder-Decoder
- Encoder-only (BERT): Bidirectional attention — ideal for classification, embeddings, understanding
- Decoder-only (GPT, Llama, Claude): Causal (masked) attention — ideal for generation
- Encoder-Decoder (T5, original): Bidirectional encoder + causal decoder — for translation/summarization
In 2026, decoder-only dominates because causal language modeling scales better and simpler architectures train more efficiently.
Key Papers
- attention-is-all-you-need — The original 2017 paper
- crystallization-transformer-architectures-2025 — 53-model convergence analysis
Related Concepts
Limitations
See attention-mechanism > Limitations for known failure modes including O(n²) complexity, softmax saturation, and topological blind spots.