Layer Normalization: Formula, Pre-Norm vs Post-Norm, and Training Stability

Category: architecture Updated: 2026-02-27

Layer normalization normalizes across d_model features per token: y = γ·(x−μ)/σ + β; applied before each sublayer in pre-norm transformers; enables stable training of 100+ layer networks (Ba et al., 2016).

Key Data Points
MeasureValueUnitNotes
Normalization formulay = γ·(x−μ)/σ + βμ = mean over d_model features; σ = std dev; γ, β learnable per dimension
Normalization axisd_model featuresComputes statistics within each token independently; not across batch or sequence
Learnable parameters per layer2 × d_modelparametersγ ∈ ℝ^{d_model} and β ∈ ℝ^{d_model}; initialized γ=1, β=0
Pre-norm training speed advantage~2×relative convergenceXiong et al. (2020): pre-norm converges faster and is more stable than post-norm
Parameters for base transformer (12 layers)12 × 4 × 2 × 512 = 49,152parameters4 LayerNorm per layer (2 encoder, 2 decoder), 2×d_model each

Layer normalization, introduced by Ba et al. (2016), stabilizes the activations within each layer of a neural network by normalizing across the feature dimension. In transformers, it is applied after (post-norm) or before (pre-norm) each sublayer, enabling stable training of very deep networks.

The Formula

For an input vector x ∈ ℝ^{d_model}:

μ = (1/d_model) Σᵢ xᵢ σ² = (1/d_model) Σᵢ (xᵢ − μ)² y = γ ⊙ (x − μ) / (σ + ε) + β

where γ and β are learnable scale and shift parameters of dimension d_model, and ε = 1e-5 is a small constant for numerical stability.

Comparison of Normalization Methods

MethodNormalizes OverBatch Size Dep.Variable LengthCommon Use
Batch NormBatch dimension per featureYesProblematicCNNs, vision
Layer NormFeature dimension per sampleNoYesTransformers, RNNs
Instance NormFeature dimension per sample per channelNoYesStyle transfer
Group NormGroups of channels per sampleNoYesVision with small batch

Pre-Norm vs Post-Norm Placement

ConfigurationFormulaBehavior
Post-norm (original, 2017)LayerNorm(x + Sublayer(x))Requires LR warmup; can diverge without it
Pre-norm (modern)x + Sublayer(LayerNorm(x))Stable without warmup; better gradient flow

Xiong et al. (2020) proved that the gradient norm in post-norm transformers is dominated by the last layer at initialization, causing instability. Pre-norm distributes gradients more evenly, making warm-up unnecessary. Large pre-trained models almost universally use pre-norm.

Parameter Budget

Layer normalization adds minimal parameters:

Model Sized_modelLN instancesLN parameters% of total
Base transformer5123030,7200.047%
Large transformer10243061,4400.029%

The computational cost of layer norm is also small relative to attention and FFN operations — roughly O(n·d) additions and multiplications per layer for sequence length n.

See residual-connections for how layer norm interacts with skip connections, and transformer-architecture for the full stack of operations in each encoder/decoder layer.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why use layer normalization instead of batch normalization in transformers?

Batch normalization computes statistics across the batch dimension, which is problematic for variable-length sequences and small batch sizes. Layer normalization normalizes across the feature dimension for each example independently, making it batch-size-agnostic and equally effective during inference. Ba et al. (2016) showed layer norm particularly suits recurrent and attention-based architectures.

What is pre-norm vs post-norm and which is better?

Post-norm (original transformer): LayerNorm(x + Sublayer(x)). Pre-norm: x + Sublayer(LayerNorm(x)). Xiong et al. (2020) showed that post-norm transformers require careful learning rate warmup to avoid divergence, while pre-norm transformers converge more reliably without warmup. Most modern large models use pre-norm (also called 'pre-layer normalization').

Does layer normalization add many parameters?

Layer normalization adds 2×d_model parameters per sublayer (γ and β, one per feature dimension). For the base transformer with d_model=512 and 24 normalization operations (2 per encoder layer × 6 + 3 per decoder layer × 6 = 30), that is 30 × 2 × 512 = 30,720 parameters — less than 0.05% of the 65M total parameter count.

← All AI pages · Dashboard