Feed-Forward Networks
Every transformer block contains a feed-forward network (FFN) that processes each token position independently. The FFN expands the representation into a higher-dimensional space, applies a non-linear transformation, then projects back down.
Original: ReLU FFN
The 2017 transformer used a two-layer FFN with ReLU:
Hidden dimension was 4× the model dimension (e.g., 4096 → 16384).
2026 Consensus: SwiGLU
SwiGLU (Swish-Gated Linear Unit) has replaced ReLU as the standard activation. It uses three weight matrices with a gating mechanism:
Where is the Swish activation and is element-wise multiplication.
Why SwiGLU won [3]:
- The gating structure acts as an adaptive information filter — it learns to selectively pass or block information
- Produces 25-40% better quality-per-FLOP despite having fewer effective parameters
- The expansion ratio is typically 8/3× (not 4×) to compensate for the extra gate parameters
Variants
- GELU — Smooth approximation of ReLU, used in GPT-3 and BERT (precursor to SwiGLU)
- ReLU² — Squared ReLU, used in some newer models (Gemma)
- GLU variants — SwiGLU, GeGLU, ReGLU — SwiGLU empirically strongest [3]
Related Concepts
References
- Tan — Crystallization of Transformer Architectures (SwiGLU adoption analysis)
- Vaswani et al. — Attention Is All You Need (original ReLU FFN)