Transformer Architecture

The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper “Attention Is All You Need” that replaces recurrence (RNNs/LSTMs) entirely with attention mechanisms. As of 2026, it is the foundational architecture behind every major LLM — GPT, Claude, Gemini, Llama, Mistral, and DeepSeek.

Core Idea

Instead of processing tokens sequentially (one at a time, like RNNs), a transformer allows every token in a sequence to directly attend to every other token simultaneously via self-attention. This parallelism made transformers dramatically faster to train and better at capturing long-range dependencies.

The 2026 Consensus Stack

Eight years of iteration converged on a remarkably consistent set of design choices across 53 major models studied in crystallization-transformer-architectures-2025:

Component	2017 Original	2026 Consensus	Why Changed
Normalization	Post-norm LayerNorm	Pre-norm RMSNorm	Pre-norm stabilizes deep training; RMSNorm is ~40% faster
Position Encoding	Sinusoidal	[[positional-encoding	RoPE]]
MLP Activation	ReLU	SwiGLU	25-40% better quality-per-FLOP
MLP Expansion	4×	8/3× (with SwiGLU)	Compensates for gate parameters
Attention Heads	Full MHA	[[kv-cache	GQA or MQA]]
Bias Terms	Present	Mostly absent	Simplifies, improves transfer
Architecture	Encoder-Decoder	Decoder-only	Sufficient for generative tasks

Block Structure

A modern transformer decoder block (2026 style) performs:

RMSNorm on input
Multi-head self-attention (with GQA and RoPE)
Residual connection: add input back
RMSNorm on result
SwiGLU feed-forward network (8/3× expansion)
Residual connection: add input back

This block repeats 32-128 times depending on model scale.

Encoder vs Decoder vs Encoder-Decoder

Encoder-only (BERT): Bidirectional attention — ideal for classification, embeddings, understanding
Decoder-only (GPT, Llama, Claude): Causal (masked) attention — ideal for generation
Encoder-Decoder (T5, original): Bidirectional encoder + causal decoder — for translation/summarization

In 2026, decoder-only dominates because causal language modeling scales better and simpler architectures train more efficiently.

Key Papers

attention-is-all-you-need — The original 2017 paper
crystallization-transformer-architectures-2025 — 53-model convergence analysis

Limitations

See attention-mechanism > Limitations for known failure modes including O(n²) complexity, softmax saturation, and topological blind spots.

Talos Research Wiki

Explorer

Transformer Architecture

Transformer Architecture

Core Idea

The 2026 Consensus Stack

Block Structure

Encoder vs Decoder vs Encoder-Decoder

Key Papers

Limitations

Graph View

Table of Contents

Backlinks

Talos Research Wiki

Explorer

Transformer Architecture

Transformer Architecture

Core Idea

The 2026 Consensus Stack

Block Structure

Encoder vs Decoder vs Encoder-Decoder

Key Papers

Related Concepts

Limitations

Graph View

Table of Contents

Backlinks