Attention Is All You Need

The seminal 2017 paper by Vaswani et al. (Google) that introduced the Transformer architecture. It is the most-cited machine learning paper of the 2010s and the foundation of every modern LLM.

Details

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Venue: Neural Information Processing Systems (NeurIPS 2017)
arXiv: 1706.03762
Submitted: June 12, 2017 (last revised August 2, 2023)

Key Contributions

Scaled dot-product attention (see attention-mechanism)
Multi-head attention: Independent attention heads capture different relationship types
Sinusoidal positional encodings (see positional-encoding)
Encoder-decoder architecture with residual connections and layer normalization
Complete displacement of recurrence: First architecture to rely entirely on attention

Original Architecture Choices

These were all reasonable but none survived unchanged to the 2026 consensus [3]:

Post-norm LayerNorm → Pre-norm RMSNorm
Sinusoidal encoding → RoPE
ReLU activation → SwiGLU
Full MHA → MQA
Bias terms in all layers → Bias-free layers
Encoder-decoder → Decoder-only

Impact

The paper launched the transformer paradigm. As of 2026, every frontier LLM (GPT-5, Claude 4, Gemini 3, Llama 4, DeepSeek V4) is built on transformer variants. The paper has >100,000 citations.

Talos Research Wiki

Explorer

Attention Is All You Need

Attention Is All You Need

Details

Key Contributions

Original Architecture Choices

Impact

Graph View

Table of Contents

Talos Research Wiki

Explorer

Attention Is All You Need

Attention Is All You Need

Details

Key Contributions

Original Architecture Choices

Impact

Related Pages

Graph View

Table of Contents