Layer Normalization: Formula, Pre-Norm vs Post-Norm, and Training Stability

Name: Layer Normalization: Formula, Pre-Norm vs Post-Norm, and Training Stability
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

Layer normalization normalizes across d_model features per token: y = γ·(x−μ)/σ + β; applied before each sublayer in pre-norm transformers; enables stable training of 100+ layer networks (Ba et al., 2016).

Key Data Points
Measure	Value	Unit	Notes
Normalization formula	y = γ·(x−μ)/σ + β		μ = mean over d_model features; σ = std dev; γ, β learnable per dimension
Normalization axis	d_model features		Computes statistics within each token independently; not across batch or sequence
Learnable parameters per layer	2 × d_model	parameters	γ ∈ ℝ^{d_model} and β ∈ ℝ^{d_model}; initialized γ=1, β=0
Pre-norm training speed advantage	~2×	relative convergence	Xiong et al. (2020): pre-norm converges faster and is more stable than post-norm
Parameters for base transformer (12 layers)	12 × 4 × 2 × 512 = 49,152	parameters	4 LayerNorm per layer (2 encoder, 2 decoder), 2×d_model each

Layer normalization, introduced by Ba et al. (2016), stabilizes the activations within each layer of a neural network by normalizing across the feature dimension. In transformers, it is applied after (post-norm) or before (pre-norm) each sublayer, enabling stable training of very deep networks.

The Formula

For an input vector x ∈ ℝ^{d_model}:

μ = (1/d_model) Σᵢ xᵢ σ² = (1/d_model) Σᵢ (xᵢ − μ)² y = γ ⊙ (x − μ) / (σ + ε) + β

where γ and β are learnable scale and shift parameters of dimension d_model, and ε = 1e-5 is a small constant for numerical stability.

Comparison of Normalization Methods

Method	Normalizes Over	Batch Size Dep.	Variable Length	Common Use
Batch Norm	Batch dimension per feature	Yes	Problematic	CNNs, vision
Layer Norm	Feature dimension per sample	No	Yes	Transformers, RNNs
Instance Norm	Feature dimension per sample per channel	No	Yes	Style transfer
Group Norm	Groups of channels per sample	No	Yes	Vision with small batch

Pre-Norm vs Post-Norm Placement

Configuration	Formula	Behavior
Post-norm (original, 2017)	LayerNorm(x + Sublayer(x))	Requires LR warmup; can diverge without it
Pre-norm (modern)	x + Sublayer(LayerNorm(x))	Stable without warmup; better gradient flow

Xiong et al. (2020) proved that the gradient norm in post-norm transformers is dominated by the last layer at initialization, causing instability. Pre-norm distributes gradients more evenly, making warm-up unnecessary. Large pre-trained models almost universally use pre-norm.

Parameter Budget

Layer normalization adds minimal parameters:

Model Size	d_model	LN instances	LN parameters	% of total
Base transformer	512	30	30,720	0.047%
Large transformer	1024	30	61,440	0.029%

The computational cost of layer norm is also small relative to attention and FFN operations — roughly O(n·d) additions and multiplications per layer for sequence length n.

See residual-connections for how layer norm interacts with skip connections, and transformer-architecture for the full stack of operations in each encoder/decoder layer.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why use layer normalization instead of batch normalization in transformers?

Batch normalization computes statistics across the batch dimension, which is problematic for variable-length sequences and small batch sizes. Layer normalization normalizes across the feature dimension for each example independently, making it batch-size-agnostic and equally effective during inference. Ba et al. (2016) showed layer norm particularly suits recurrent and attention-based architectures.

What is pre-norm vs post-norm and which is better?

Post-norm (original transformer): LayerNorm(x + Sublayer(x)). Pre-norm: x + Sublayer(LayerNorm(x)). Xiong et al. (2020) showed that post-norm transformers require careful learning rate warmup to avoid divergence, while pre-norm transformers converge more reliably without warmup. Most modern large models use pre-norm (also called 'pre-layer normalization').

Does layer normalization add many parameters?

Layer normalization adds 2×d_model parameters per sublayer (γ and β, one per feature dimension). For the base transformer with d_model=512 and 24 normalization operations (2 per encoder layer × 6 + 3 per decoder layer × 6 = 30), that is 30 × 2 × 512 = 30,720 parameters — less than 0.05% of the 65M total parameter count.

← All AI pages · Dashboard