Backpropagation: Chain Rule, Computational Graphs, and Automatic Differentiation

Category: training Updated: 2026-02-27

Backpropagation computes ∂L/∂W via the chain rule: ∂L/∂W = ∂L/∂output · ∂output/∂W; a single backward pass computes all N parameter gradients in O(N) operations — same asymptotic cost as the forward pass (Rumelhart et al., 1986).

Key Data Points
MeasureValueUnitNotes
Chain rule for composed functions∂L/∂x = ∂L/∂y · ∂y/∂xApplied recursively through each layer; generalizes to multivariable Jacobians
Backward pass cost~2–3× forward passrelative FLOPsTwo matrix multiplies per layer vs one for forward; plus activation gradient computations
Memory for activationsO(n · d · L)proportionaln=sequence length, d=model dim, L=layers; all activations must be stored for backward
Gradient checkpointing memory savingsO(√L)recomputationChen et al. (2016): recompute activations during backward; trades memory for compute
Automatic differentiation modesForward mode vs reverse modeReverse mode (backprop) computes all N gradients in one pass; optimal for N >> 1 outputs

Backpropagation is the algorithm that makes training deep neural networks feasible. By efficiently applying the chain rule through a computational graph, it computes the gradient of the loss with respect to every parameter in a single backward pass — the same asymptotic cost as one forward pass.

The Chain Rule

For composed functions h(x) = g(f(x)):

∂h/∂x = (∂g/∂f) · (∂f/∂x)

In a neural network, the loss L is a composition of many functions: L = loss(layer_L(layer_{L-1}(…layer_1(x)…))). The chain rule applied recursively gives:

∂L/∂W_l = ∂L/∂a_L · ∂a_L/∂a_{L-1} · … · ∂a_{l+1}/∂a_l · ∂a_l/∂W_l

Computational Graph Representation

Node TypeForward ComputationBackward Gradient
Linear (xW)y = xW∂L/∂x = ∂L/∂y · Wᵀ; ∂L/∂W = xᵀ · ∂L/∂y
ReLUy = max(0, x)∂L/∂x = ∂L/∂y · 𝟙[x > 0]
Softmaxyᵢ = e^{xᵢ}/Σe^{xⱼ}∂L/∂x = (∂L/∂y - Σ∂L/∂yⱼ·yⱼ) ⊙ y
LayerNormy = (x-μ)/σ · γ + βComplex; involves mean/variance gradients

Forward vs Backward Memory

ModeMemoryFLOPsUse Case
Forward onlyO(1) per layer (streaming)O(N)Inference
Forward + backwardO(N) total activationsO(3N)Training
Gradient checkpointingO(√L · n · d)O(N + recompute)Memory-limited training

Gradient Accumulation

When batch sizes are limited by GPU memory, gradient accumulation simulates larger batches:

  1. Run mini-batch forward + backward, accumulate gradients (no optimizer step)
  2. Repeat K times
  3. After K accumulations, divide by K and take optimizer step

This is equivalent to training with K× larger effective batch size, which is important for matching training dynamics in scaling experiments.

See gradient-descent for how computed gradients are used in parameter updates, compute-flops for the total FLOPs accounting including the backward pass, and neural-network-fundamentals for the mathematical foundations.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is reverse-mode automatic differentiation (backprop) more efficient than forward mode for neural networks?

Forward-mode AD computes derivatives of all outputs with respect to one input — efficient when there are few inputs and many outputs. Reverse-mode AD computes derivatives of one scalar output (the loss L) with respect to all inputs — efficient when there are many inputs (all model parameters) and one output. Neural networks have millions of parameters but only one loss scalar, making reverse-mode O(N) vs forward-mode O(N²) for computing all N gradients.

What is the memory cost of backpropagation?

Backpropagation requires storing all intermediate activations from the forward pass — these are needed to compute gradients during the backward pass. For a transformer with L layers, sequence length n, and model dimension d, activation memory is O(n·d·L). For large models at long sequence lengths, this can exceed weight memory. Gradient checkpointing (Chen et al., 2016) trades compute for memory: only checkpoint activations at every √L layer, recomputing others during backward, reducing memory to O(√L) activations.

How does backpropagation through the attention mechanism work?

The attention computation Attention(Q,K,V) = softmax(Q·Kᵀ/√d_k)·V is differentiable. During backward, gradients flow through the softmax (∂softmax is the Jacobian of the softmax operation), through the matrix multiply Q·Kᵀ (gradients split between Q and K), and through the value weighting (gradients split between attention weights and V). PyTorch and JAX implement this automatically via operator fusion in FlashAttention, which also avoids storing the full n×n attention matrix for memory efficiency.

← All AI pages · Dashboard