Feed-Forward Networks

Every transformer block contains a feed-forward network (FFN) that processes each token position independently. The FFN expands the representation into a higher-dimensional space, applies a non-linear transformation, then projects back down.

Original: ReLU FFN

The 2017 transformer used a two-layer FFN with ReLU:

$FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

Hidden dimension was 4× the model dimension (e.g., 4096 → 16384).

2026 Consensus: SwiGLU

SwiGLU (Swish-Gated Linear Unit) has replaced ReLU as the standard activation. It uses three weight matrices with a gating mechanism:

$SwiGLU (x) = (x W_{1}) ⊙ Swish (x W_{2}) \cdot W_{3}$

Where $Swish (x) = x \cdot σ (x)$ is the Swish activation and $⊙$ is element-wise multiplication.

Why SwiGLU won [3]:

The gating structure acts as an adaptive information filter — it learns to selectively pass or block information
Produces 25-40% better quality-per-FLOP despite having fewer effective parameters
The expansion ratio is typically 8/3× (not 4×) to compensate for the extra gate parameters

Variants

GELU — Smooth approximation of ReLU, used in GPT-3 and BERT (precursor to SwiGLU)
ReLU² — Squared ReLU, used in some newer models (Gemma)
GLU variants — SwiGLU, GeGLU, ReGLU — SwiGLU empirically strongest [3]

References

Tan — Crystallization of Transformer Architectures (SwiGLU adoption analysis)
Vaswani et al. — Attention Is All You Need (original ReLU FFN)

Talos Research Wiki

Explorer

Feed-Forward Networks (SwiGLU)

Feed-Forward Networks

Original: ReLU FFN

2026 Consensus: SwiGLU

Variants

References

Graph View

Table of Contents

Backlinks

Talos Research Wiki

Explorer

Feed-Forward Networks (SwiGLU)

Feed-Forward Networks

Original: ReLU FFN

2026 Consensus: SwiGLU

Variants

Related Concepts

References

Graph View

Table of Contents

Backlinks