Question 1

Why is reverse-mode automatic differentiation (backprop) more efficient than forward mode for neural networks?

Accepted Answer

Forward-mode AD computes derivatives of all outputs with respect to one input — efficient when there are few inputs and many outputs. Reverse-mode AD computes derivatives of one scalar output (the loss L) with respect to all inputs — efficient when there are many inputs (all model parameters) and one output. Neural networks have millions of parameters but only one loss scalar, making reverse-mode O(N) vs forward-mode O(N²) for computing all N gradients.

Question 2

What is the memory cost of backpropagation?

Accepted Answer

Backpropagation requires storing all intermediate activations from the forward pass — these are needed to compute gradients during the backward pass. For a transformer with L layers, sequence length n, and model dimension d, activation memory is O(n·d·L). For large models at long sequence lengths, this can exceed weight memory. Gradient checkpointing (Chen et al., 2016) trades compute for memory: only checkpoint activations at every √L layer, recomputing others during backward, reducing memory to O(√L) activations.

Question 3

How does backpropagation through the attention mechanism work?

Accepted Answer

The attention computation Attention(Q,K,V) = softmax(Q·Kᵀ/√d_k)·V is differentiable. During backward, gradients flow through the softmax (∂softmax is the Jacobian of the softmax operation), through the matrix multiply Q·Kᵀ (gradients split between Q and K), and through the value weighting (gradients split between attention weights and V). PyTorch and JAX implement this automatically via operator fusion in FlashAttention, which also avoids storing the full n×n attention matrix for memory efficiency.

Measure	Value	Unit	Notes
Chain rule for composed functions	∂L/∂x = ∂L/∂y · ∂y/∂x		Applied recursively through each layer; generalizes to multivariable Jacobians
Backward pass cost	~2–3× forward pass	relative FLOPs	Two matrix multiplies per layer vs one for forward; plus activation gradient computations
Memory for activations	O(n · d · L)	proportional	n=sequence length, d=model dim, L=layers; all activations must be stored for backward
Gradient checkpointing memory savings	O(√L)	recomputation	Chen et al. (2016): recompute activations during backward; trades memory for compute
Automatic differentiation modes	Forward mode vs reverse mode		Reverse mode (backprop) computes all N gradients in one pass; optimal for N >> 1 outputs

Node Type	Forward Computation	Backward Gradient
Linear (xW)	y = xW	∂L/∂x = ∂L/∂y · Wᵀ; ∂L/∂W = xᵀ · ∂L/∂y
ReLU	y = max(0, x)	∂L/∂x = ∂L/∂y · 𝟙[x > 0]
Softmax	yᵢ = e^{xᵢ}/Σe^{xⱼ}	∂L/∂x = (∂L/∂y - Σ∂L/∂yⱼ·yⱼ) ⊙ y
LayerNorm	y = (x-μ)/σ · γ + β	Complex; involves mean/variance gradients

Mode	Memory	FLOPs	Use Case
Forward only	O(1) per layer (streaming)	O(N)	Inference
Forward + backward	O(N) total activations	O(3N)	Training
Gradient checkpointing	O(√L · n · d)	O(N + recompute)	Memory-limited training

Backpropagation: Chain Rule, Computational Graphs, and Automatic Differentiation

The Chain Rule

Computational Graph Representation

Forward vs Backward Memory

Gradient Accumulation

Related Pages

Sources

Frequently Asked Questions

Why is reverse-mode automatic differentiation (backprop) more efficient than forward mode for neural networks?

What is the memory cost of backpropagation?

How does backpropagation through the attention mechanism work?