Layer Normalization: Formula, Pre-Norm vs Post-Norm, and Training Stability
Layer normalization normalizes across d_model features per token: y = γ·(x−μ)/σ + β; applied before each sublayer in pre-norm transformers; enables stable training of 100+ layer networks (Ba et al., 2016).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Normalization formula | y = γ·(x−μ)/σ + β | μ = mean over d_model features; σ = std dev; γ, β learnable per dimension | |
| Normalization axis | d_model features | Computes statistics within each token independently; not across batch or sequence | |
| Learnable parameters per layer | 2 × d_model | parameters | γ ∈ ℝ^{d_model} and β ∈ ℝ^{d_model}; initialized γ=1, β=0 |
| Pre-norm training speed advantage | ~2× | relative convergence | Xiong et al. (2020): pre-norm converges faster and is more stable than post-norm |
| Parameters for base transformer (12 layers) | 12 × 4 × 2 × 512 = 49,152 | parameters | 4 LayerNorm per layer (2 encoder, 2 decoder), 2×d_model each |
Layer normalization, introduced by Ba et al. (2016), stabilizes the activations within each layer of a neural network by normalizing across the feature dimension. In transformers, it is applied after (post-norm) or before (pre-norm) each sublayer, enabling stable training of very deep networks.
The Formula
For an input vector x ∈ ℝ^{d_model}:
μ = (1/d_model) Σᵢ xᵢ σ² = (1/d_model) Σᵢ (xᵢ − μ)² y = γ ⊙ (x − μ) / (σ + ε) + β
where γ and β are learnable scale and shift parameters of dimension d_model, and ε = 1e-5 is a small constant for numerical stability.
Comparison of Normalization Methods
| Method | Normalizes Over | Batch Size Dep. | Variable Length | Common Use |
|---|---|---|---|---|
| Batch Norm | Batch dimension per feature | Yes | Problematic | CNNs, vision |
| Layer Norm | Feature dimension per sample | No | Yes | Transformers, RNNs |
| Instance Norm | Feature dimension per sample per channel | No | Yes | Style transfer |
| Group Norm | Groups of channels per sample | No | Yes | Vision with small batch |
Pre-Norm vs Post-Norm Placement
| Configuration | Formula | Behavior |
|---|---|---|
| Post-norm (original, 2017) | LayerNorm(x + Sublayer(x)) | Requires LR warmup; can diverge without it |
| Pre-norm (modern) | x + Sublayer(LayerNorm(x)) | Stable without warmup; better gradient flow |
Xiong et al. (2020) proved that the gradient norm in post-norm transformers is dominated by the last layer at initialization, causing instability. Pre-norm distributes gradients more evenly, making warm-up unnecessary. Large pre-trained models almost universally use pre-norm.
Parameter Budget
Layer normalization adds minimal parameters:
| Model Size | d_model | LN instances | LN parameters | % of total |
|---|---|---|---|---|
| Base transformer | 512 | 30 | 30,720 | 0.047% |
| Large transformer | 1024 | 30 | 61,440 | 0.029% |
The computational cost of layer norm is also small relative to attention and FFN operations — roughly O(n·d) additions and multiplications per layer for sequence length n.
See residual-connections for how layer norm interacts with skip connections, and transformer-architecture for the full stack of operations in each encoder/decoder layer.
Related Pages
Sources
- Ba et al. (2016) — Layer Normalization. arXiv 2016
- Xiong et al. (2020) — On Layer Normalization in the Transformer Architecture. ICML 2020
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
Frequently Asked Questions
Why use layer normalization instead of batch normalization in transformers?
Batch normalization computes statistics across the batch dimension, which is problematic for variable-length sequences and small batch sizes. Layer normalization normalizes across the feature dimension for each example independently, making it batch-size-agnostic and equally effective during inference. Ba et al. (2016) showed layer norm particularly suits recurrent and attention-based architectures.
What is pre-norm vs post-norm and which is better?
Post-norm (original transformer): LayerNorm(x + Sublayer(x)). Pre-norm: x + Sublayer(LayerNorm(x)). Xiong et al. (2020) showed that post-norm transformers require careful learning rate warmup to avoid divergence, while pre-norm transformers converge more reliably without warmup. Most modern large models use pre-norm (also called 'pre-layer normalization').
Does layer normalization add many parameters?
Layer normalization adds 2×d_model parameters per sublayer (γ and β, one per feature dimension). For the base transformer with d_model=512 and 24 normalization operations (2 per encoder layer × 6 + 3 per decoder layer × 6 = 30), that is 30 × 2 × 512 = 30,720 parameters — less than 0.05% of the 65M total parameter count.