Gradient Descent and Adam Optimizer: Update Rules and Hyperparameters

Category: training Updated: 2026-02-27

SGD updates θ ← θ − η∇L; Adam adapts per-parameter learning rates using m_t = β₁m_{t-1}+(1−β₁)g_t and v_t = β₂v_{t-1}+(1−β₂)g_t²; typical transformer settings: β₁=0.9, β₂=0.95–0.999, ε=1e-8 (Kingma & Ba, 2015).

Key Data Points
MeasureValueUnitNotes
SGD update ruleθ_{t+1} = θ_t − η · ∇_θ Lη = learning rate; ∇_θ L = gradient of loss w.r.t. parameters
Adam first moment (β₁)0.9Exponential moving average of gradients; controls gradient smoothing
Adam second moment (β₂)0.999Exponential moving average of squared gradients; controls adaptive scaling
Adam epsilon1e-8Numerical stability term; prevents division by zero when v_t ≈ 0
Transformer warmup steps (original)4,000stepslr = d_model^{-0.5} · min(step^{-0.5}, step · warmup_steps^{-1.5})
Typical peak learning rate (large models)1e-4 to 3e-4With warmup + cosine decay; exact value scales inversely with batch size

Gradient descent is the fundamental optimization algorithm for training neural networks. Starting from random weights, each iteration computes the gradient of the loss with respect to all parameters and takes a step in the negative gradient direction. Modern language model training universally uses variants of Adam, which adapts the learning rate per parameter using gradient history.

Gradient Descent Variants

AlgorithmUpdate RuleKey Property
Vanilla SGDθ ← θ − η·gSimple; requires careful tuning
SGD + Momentumv ← μv − η·g; θ ← θ + vAccelerates in consistent directions
AdaGradθ ← θ − η·g/(√G + ε)Adapts to rare features; learning rate never increases
RMSPropθ ← θ − η·g/(√EMA[g²] + ε)Decaying average of squared gradients
AdamSee belowCombines momentum + RMSProp; dominant in practice
AdamWAdam + decoupled weight decayCorrects L2 regularization in Adam

Adam Update Rules (Kingma & Ba, 2015)

At step t, with gradient g_t = ∇_θ L:

  1. First moment: m_t = β₁·m_{t-1} + (1−β₁)·g_t
  2. Second moment: v_t = β₂·v_{t-1} + (1−β₂)·g_t²
  3. Bias correction: m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ)
  4. Update: θ_{t+1} = θ_t − η · m̂_t / (√v̂_t + ε)

Hyperparameter Defaults

HyperparameterAdam DefaultTypical LLM Training
β₁0.90.9
β₂0.9990.95–0.999
ε1e-81e-8
Weight decay λ (AdamW)00.01–0.1
Peak learning rate1e-4 to 3e-4
Gradient clippingNorm ≤ 1.0

Gradient Clipping

Gradient clipping prevents exploding gradients by rescaling the gradient vector when its norm exceeds a threshold:

if ‖g‖₂ > max_norm: g ← g × (max_norm / ‖g‖₂)

Large language model training universally applies gradient clipping with max_norm = 1.0. Without clipping, occasional large gradient spikes (common in attention layers) can destabilize training.

Learning Rate Schedule (Transformer)

Vaswani et al. (2017) define:

lr(step) = d_model^{−0.5} · min(step^{−0.5}, step · 4000^{−1.5})

StepLearning Rate (d_model=512)
1~1.1 × 10⁻⁵
1,000~4.9 × 10⁻⁴
4,000 (peak)~2.4 × 10⁻⁴
10,000~1.4 × 10⁻⁴
100,000~4.4 × 10⁻⁵

See backpropagation for how gradients are computed, and neural-network-fundamentals for the broader optimization landscape.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does Adam outperform vanilla SGD for transformer training?

Transformers have parameters across very different scales — embedding weights, attention projections, and FFN weights all have different gradient magnitudes. Adam's per-parameter adaptive learning rates normalize these differences: parameters with consistently large gradients get smaller effective learning rates, while parameters with small gradients get larger effective steps. This adaptive scaling makes Adam much more robust to the choice of global learning rate and is why it dominates in language model training.

What is AdamW and why is it used instead of Adam?

Original Adam implements L2 regularization by adding λθ to the gradient, which interacts with the adaptive learning rate scaling in an undesirable way — parameters updated infrequently get less regularization. Decoupled weight decay (AdamW, Loshchilov & Hutter 2019) applies weight decay directly to the parameters: θ_{t+1} = (1 − ηλ)θ_t − η·Adam_update. This correctly decouples weight decay from the gradient-based update, improving generalization. Most large model training uses AdamW.

What is the transformer learning rate schedule?

Vaswani et al. (2017) introduced a warmup-then-decay schedule: lr(step) = d_model^{-0.5} × min(step^{-0.5}, step × warmup_steps^{-1.5}). This linearly increases the learning rate for the first warmup_steps, then decreases proportionally to the inverse square root of the step count. The 4,000-step warmup in the original paper is roughly 4% of training — modern practice varies from 1% to 10% of total steps depending on model size and batch size.

← All AI pages · Dashboard