Transformer Architecture

The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper “Attention Is All You Need” that replaces recurrence (RNNs/LSTMs) entirely with attention mechanisms. As of 2026, it is the foundational architecture behind every major LLM — GPT, Claude, Gemini, Llama, Mistral, and DeepSeek.

Core Idea

Instead of processing tokens sequentially (one at a time, like RNNs), a transformer allows every token in a sequence to directly attend to every other token simultaneously via self-attention. This parallelism made transformers dramatically faster to train and better at capturing long-range dependencies.

The 2026 Consensus Stack

Eight years of iteration converged on a remarkably consistent set of design choices across 53 major models studied in crystallization-transformer-architectures-2025:

Component2017 Original2026 ConsensusWhy Changed
NormalizationPost-norm LayerNormPre-norm RMSNormPre-norm stabilizes deep training; RMSNorm is ~40% faster
Position EncodingSinusoidal[[positional-encodingRoPE]]
MLP ActivationReLUSwiGLU25-40% better quality-per-FLOP
MLP Expansion8/3× (with SwiGLU)Compensates for gate parameters
Attention HeadsFull MHA[[kv-cacheGQA or MQA]]
Bias TermsPresentMostly absentSimplifies, improves transfer
ArchitectureEncoder-DecoderDecoder-onlySufficient for generative tasks

Block Structure

A modern transformer decoder block (2026 style) performs:

  1. RMSNorm on input
  2. Multi-head self-attention (with GQA and RoPE)
  3. Residual connection: add input back
  4. RMSNorm on result
  5. SwiGLU feed-forward network (8/3× expansion)
  6. Residual connection: add input back

This block repeats 32-128 times depending on model scale.

Encoder vs Decoder vs Encoder-Decoder

  • Encoder-only (BERT): Bidirectional attention — ideal for classification, embeddings, understanding
  • Decoder-only (GPT, Llama, Claude): Causal (masked) attention — ideal for generation
  • Encoder-Decoder (T5, original): Bidirectional encoder + causal decoder — for translation/summarization

In 2026, decoder-only dominates because causal language modeling scales better and simpler architectures train more efficiently.

Key Papers

Limitations

See attention-mechanism > Limitations for known failure modes including O(n²) complexity, softmax saturation, and topological blind spots.