Attention Is All You Need
The seminal 2017 paper by Vaswani et al. (Google) that introduced the Transformer architecture. It is the most-cited machine learning paper of the 2010s and the foundation of every modern LLM.
Details
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Venue: Neural Information Processing Systems (NeurIPS 2017)
- arXiv: 1706.03762
- Submitted: June 12, 2017 (last revised August 2, 2023)
Key Contributions
- Scaled dot-product attention (see attention-mechanism)
- Multi-head attention: Independent attention heads capture different relationship types
- Sinusoidal positional encodings (see positional-encoding)
- Encoder-decoder architecture with residual connections and layer normalization
- Complete displacement of recurrence: First architecture to rely entirely on attention
Original Architecture Choices
These were all reasonable but none survived unchanged to the 2026 consensus [3]:
- Post-norm LayerNorm → Pre-norm RMSNorm
- Sinusoidal encoding → RoPE
- ReLU activation → SwiGLU
- Full MHA → MQA
- Bias terms in all layers → Bias-free layers
- Encoder-decoder → Decoder-only
Impact
The paper launched the transformer paradigm. As of 2026, every frontier LLM (GPT-5, Claude 4, Gemini 3, Llama 4, DeepSeek V4) is built on transformer variants. The paper has >100,000 citations.