Attention Is All You Need

The seminal 2017 paper by Vaswani et al. (Google) that introduced the Transformer architecture. It is the most-cited machine learning paper of the 2010s and the foundation of every modern LLM.

Details

  • Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
  • Venue: Neural Information Processing Systems (NeurIPS 2017)
  • arXiv: 1706.03762
  • Submitted: June 12, 2017 (last revised August 2, 2023)

Key Contributions

  1. Scaled dot-product attention (see attention-mechanism)
  2. Multi-head attention: Independent attention heads capture different relationship types
  3. Sinusoidal positional encodings (see positional-encoding)
  4. Encoder-decoder architecture with residual connections and layer normalization
  5. Complete displacement of recurrence: First architecture to rely entirely on attention

Original Architecture Choices

These were all reasonable but none survived unchanged to the 2026 consensus [3]:

  • Post-norm LayerNorm → Pre-norm RMSNorm
  • Sinusoidal encoding → RoPE
  • ReLU activation → SwiGLU
  • Full MHA → MQA
  • Bias terms in all layers → Bias-free layers
  • Encoder-decoder → Decoder-only

Impact

The paper launched the transformer paradigm. As of 2026, every frontier LLM (GPT-5, Claude 4, Gemini 3, Llama 4, DeepSeek V4) is built on transformer variants. The paper has >100,000 citations.