Position-Wise Feed-Forward Layers: FFN Formula, Parameter Budget, and GeLU vs ReLU

Category: architecture Updated: 2026-02-27

Each transformer FFN layer computes max(0,xW₁+b₁)W₂+b₂ with d_ff=2048 (4× d_model=512); FFN sublayers account for ~67% of the base model's 65M parameters; GeLU outperforms ReLU on NLP benchmarks (Hendrycks & Gimpel, 2016).

Key Data Points
MeasureValueUnitNotes
FFN formulaFFN(x) = max(0, xW₁ + b₁)W₂ + b₂ReLU activation; GeLU variant: FFN(x) = GELU(xW₁ + b₁)W₂ + b₂
d_model (input/output dimension)512dimensionsFFN input and output match d_model for residual connections
d_ff (inner dimension)2048dimensions4× d_model; chosen empirically; expands and then compresses the representation
W₁ parameters (per layer)512 × 2048 + 2048 = 1,050,624parametersWeights + biases for the expansion layer
W₂ parameters (per layer)2048 × 512 + 512 = 1,049,088parametersWeights + biases for the compression layer
Total FFN parameters per layer2,099,712parameters~2.1M per encoder or decoder layer; vs ~1.05M for the attention block
FFN share of base model parameters~67%percent12 FFN layers × 2.1M = 25.2M; attention blocks contribute ~12.6M; remainder is embeddings
GeLU vs ReLU (CIFAR-10 error)7.89% vs 8.16%error rateGeLU achieves lower error; Hendrycks & Gimpel (2016) Table 1

The position-wise feed-forward network (FFN) is the second major sublayer in every transformer encoder and decoder layer. Applied independently to each token position after the attention sublayer, it provides the non-linear capacity that multi-head attention — which applies only linear transformations to value vectors — cannot supply alone.

The Formula

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

where:

  • x ∈ ℝ^{d_model} is the d_model=512 dimensional input for one token position
  • W₁ ∈ ℝ^{d_model × d_ff} = ℝ^{512 × 2048} — expands the representation
  • W₂ ∈ ℝ^{d_ff × d_model} = ℝ^{2048 × 512} — compresses back to d_model
  • max(0, ·) is ReLU; the same sublayer with GeLU is GELU(xW₁ + b₁)W₂ + b₂

The FFN is applied identically and independently to each of the n token positions — it does not mix position information. The network is “position-wise” in the same sense that a 1×1 convolution is channel-wise.

Parameter Breakdown Across Architectures

HyperparameterBase ModelBig ModelModern 4× rule
d_model5121024varies
d_ff204840964 × d_model
d_ff / d_model ratio
W₁ parameters (per FFN)1,048,5764,194,304
W₂ parameters (per FFN)1,048,5764,194,304
Biases (per FFN)2,5604,608
Total per FFN sublayer~2.1M~8.4M

Where Do the Parameters Go? (Base Model, 65M Total)

ComponentLayersParams per layerTotal
Token embeddings (vocab=37,000)~18.9M
Attention blocks (enc + dec)12~1.05M~12.6M
FFN sublayers (enc + dec)12~2.1M~25.2M
LayerNorm + output projection~8.3M
Total~65M

FFN sublayers account for roughly 39% of the base model’s parameters on their own; combined with the attention blocks’ parameter budget (~19%), the 6+6 transformer layers hold ~88% of all parameters. Approximately two-thirds of per-layer parameters are in the FFN.

GeLU vs ReLU

The original transformer uses ReLU. Later architectures switched to GeLU (Hendrycks & Gimpel, 2016), which is defined as:

GeLU(x) = x · Φ(x)

where Φ(x) is the standard Gaussian CDF. Unlike ReLU, GeLU applies a smooth, probabilistic gate that decreases output for negative inputs rather than zeroing them entirely.

ActivationCIFAR-10 errorCIFAR-100 errorCharacteristic
ReLU8.16%21.77%Hard threshold at 0; sparse activations
ELU8.41%22.98%Smooth for negative inputs
GeLU7.89%20.74%Smooth gate; weights by magnitude

Shazeer (2020) further extended this with Gated Linear Units (GLU), where the FFN becomes:

FFN_GLU(x) = (xW₁ ⊙ σ(xW_gate)) W₂

This variant and its GeLU-gated form (SwiGLU) are widely used in modern architectures for improved training stability.

See multi-head-attention for the other parameter-dense sublayer in each layer, self-attention-mechanism for the attention formula, and transformer-architecture for how FFN and attention sublayers are composed with residual connections and layer normalization.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is d_ff set to 4× d_model in the original transformer?

The 4× ratio (d_ff=2048 for d_model=512) was chosen empirically by Vaswani et al. It provides sufficient capacity for the FFN to perform complex non-linear transformations of each token's representation after attention. In practice, d_ff ratios from 2.67× to 8× are used across modern architectures, with the 4× ratio remaining a common default.

What is the role of the FFN layer if attention already mixes token information?

Multi-head attention mixes information across token positions, but applies only a linear transformation to each token's value vector. The position-wise FFN applies an independent non-linear transformation to each token's representation individually. It is thought to act as a key-value memory (Geva et al., 2021), storing factual associations learned during training.

Why does GeLU outperform ReLU in transformer architectures?

GeLU (x·Φ(x), where Φ is the standard normal CDF) is a smooth function that weights inputs by their magnitude rather than applying a hard threshold at zero. This smoother activation landscape tends to produce better-conditioned gradients during training on language tasks. Hendrycks & Gimpel (2016) showed consistent improvements over ReLU across NLP, vision, and speech benchmarks.

← All AI pages · Dashboard