Backpropagation: Chain Rule, Computational Graphs, and Automatic Differentiation
Backpropagation computes ∂L/∂W via the chain rule: ∂L/∂W = ∂L/∂output · ∂output/∂W; a single backward pass computes all N parameter gradients in O(N) operations — same asymptotic cost as the forward pass (Rumelhart et al., 1986).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Chain rule for composed functions | ∂L/∂x = ∂L/∂y · ∂y/∂x | Applied recursively through each layer; generalizes to multivariable Jacobians | |
| Backward pass cost | ~2–3× forward pass | relative FLOPs | Two matrix multiplies per layer vs one for forward; plus activation gradient computations |
| Memory for activations | O(n · d · L) | proportional | n=sequence length, d=model dim, L=layers; all activations must be stored for backward |
| Gradient checkpointing memory savings | O(√L) | recomputation | Chen et al. (2016): recompute activations during backward; trades memory for compute |
| Automatic differentiation modes | Forward mode vs reverse mode | Reverse mode (backprop) computes all N gradients in one pass; optimal for N >> 1 outputs |
Backpropagation is the algorithm that makes training deep neural networks feasible. By efficiently applying the chain rule through a computational graph, it computes the gradient of the loss with respect to every parameter in a single backward pass — the same asymptotic cost as one forward pass.
The Chain Rule
For composed functions h(x) = g(f(x)):
∂h/∂x = (∂g/∂f) · (∂f/∂x)
In a neural network, the loss L is a composition of many functions: L = loss(layer_L(layer_{L-1}(…layer_1(x)…))). The chain rule applied recursively gives:
∂L/∂W_l = ∂L/∂a_L · ∂a_L/∂a_{L-1} · … · ∂a_{l+1}/∂a_l · ∂a_l/∂W_l
Computational Graph Representation
| Node Type | Forward Computation | Backward Gradient |
|---|---|---|
| Linear (xW) | y = xW | ∂L/∂x = ∂L/∂y · Wᵀ; ∂L/∂W = xᵀ · ∂L/∂y |
| ReLU | y = max(0, x) | ∂L/∂x = ∂L/∂y · 𝟙[x > 0] |
| Softmax | yᵢ = e^{xᵢ}/Σe^{xⱼ} | ∂L/∂x = (∂L/∂y - Σ∂L/∂yⱼ·yⱼ) ⊙ y |
| LayerNorm | y = (x-μ)/σ · γ + β | Complex; involves mean/variance gradients |
Forward vs Backward Memory
| Mode | Memory | FLOPs | Use Case |
|---|---|---|---|
| Forward only | O(1) per layer (streaming) | O(N) | Inference |
| Forward + backward | O(N) total activations | O(3N) | Training |
| Gradient checkpointing | O(√L · n · d) | O(N + recompute) | Memory-limited training |
Gradient Accumulation
When batch sizes are limited by GPU memory, gradient accumulation simulates larger batches:
- Run mini-batch forward + backward, accumulate gradients (no optimizer step)
- Repeat K times
- After K accumulations, divide by K and take optimizer step
This is equivalent to training with K× larger effective batch size, which is important for matching training dynamics in scaling experiments.
See gradient-descent for how computed gradients are used in parameter updates, compute-flops for the total FLOPs accounting including the backward pass, and neural-network-fundamentals for the mathematical foundations.
Related Pages
Sources
- Rumelhart, Hinton & Williams (1986) — Learning representations by back-propagating errors. Nature
- Baydin et al. (2018) — Automatic Differentiation in Machine Learning: a Survey. JMLR
- Goodfellow et al. (2016) — Deep Learning. MIT Press (Chapter 6)
Frequently Asked Questions
Why is reverse-mode automatic differentiation (backprop) more efficient than forward mode for neural networks?
Forward-mode AD computes derivatives of all outputs with respect to one input — efficient when there are few inputs and many outputs. Reverse-mode AD computes derivatives of one scalar output (the loss L) with respect to all inputs — efficient when there are many inputs (all model parameters) and one output. Neural networks have millions of parameters but only one loss scalar, making reverse-mode O(N) vs forward-mode O(N²) for computing all N gradients.
What is the memory cost of backpropagation?
Backpropagation requires storing all intermediate activations from the forward pass — these are needed to compute gradients during the backward pass. For a transformer with L layers, sequence length n, and model dimension d, activation memory is O(n·d·L). For large models at long sequence lengths, this can exceed weight memory. Gradient checkpointing (Chen et al., 2016) trades compute for memory: only checkpoint activations at every √L layer, recomputing others during backward, reducing memory to O(√L) activations.
How does backpropagation through the attention mechanism work?
The attention computation Attention(Q,K,V) = softmax(Q·Kᵀ/√d_k)·V is differentiable. During backward, gradients flow through the softmax (∂softmax is the Jacobian of the softmax operation), through the matrix multiply Q·Kᵀ (gradients split between Q and K), and through the value weighting (gradients split between attention weights and V). PyTorch and JAX implement this automatically via operator fusion in FlashAttention, which also avoids storing the full n×n attention matrix for memory efficiency.