Feed-Forward Networks

Every transformer block contains a feed-forward network (FFN) that processes each token position independently. The FFN expands the representation into a higher-dimensional space, applies a non-linear transformation, then projects back down.

Original: ReLU FFN

The 2017 transformer used a two-layer FFN with ReLU:

Hidden dimension was 4× the model dimension (e.g., 4096 → 16384).

2026 Consensus: SwiGLU

SwiGLU (Swish-Gated Linear Unit) has replaced ReLU as the standard activation. It uses three weight matrices with a gating mechanism:

Where is the Swish activation and is element-wise multiplication.

Why SwiGLU won [3]:

  • The gating structure acts as an adaptive information filter — it learns to selectively pass or block information
  • Produces 25-40% better quality-per-FLOP despite having fewer effective parameters
  • The expansion ratio is typically 8/3× (not 4×) to compensate for the extra gate parameters

Variants

  • GELU — Smooth approximation of ReLU, used in GPT-3 and BERT (precursor to SwiGLU)
  • ReLU² — Squared ReLU, used in some newer models (Gemma)
  • GLU variants — SwiGLU, GeGLU, ReGLU — SwiGLU empirically strongest [3]

References

  • Tan — Crystallization of Transformer Architectures (SwiGLU adoption analysis)
  • Vaswani et al. — Attention Is All You Need (original ReLU FFN)