Softmax Function: Formula, Temperature Scaling, and Numerical Stability

Category: architecture Updated: 2026-02-27

Softmax σ(z_i) = e^{z_i}/Σe^{z_j} converts attention logits to probability distributions; temperature T<1 sharpens toward argmax (greedy), T→∞ flattens to uniform; numerically stabilized by subtracting max(z).

Key Data Points
MeasureValueUnitNotes
Softmax formulaσ(z_i) = e^{z_i} / Σ_j e^{z_j}Input z ∈ ℝ^K; output is probability vector summing to 1
Temperature-scaled softmaxσ(z_i/T)T=1 standard; T→0 argmax (greedy); T→∞ uniform distribution
Numerically stable computationσ(z_i − max(z))Subtracting max(z) prevents overflow; does not change the output value
Gradient (diagonal Jacobian)∂σ_i/∂z_i = σ_i(1 − σ_i)Derivative w.r.t. own input; off-diagonal: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j
Attention softmax input scale (d_k=64)÷8 (÷√d_k)Dividing by √d_k=8 prevents large logits that saturate softmax gradients

The softmax function maps a vector of arbitrary real numbers (logits) to a probability distribution over K categories. It is ubiquitous in language models: converting raw attention scores to attention weights, converting final hidden states to next-token probability distributions, and controlling sampling behavior via temperature.

The Formula

For a vector z = (z₁, z₂, …, z_K):

σ(z)_i = e^{z_i} / Σⱼ₌₁ᴷ e^{z_j}

Properties:

  • Each output is in (0, 1) — strictly positive
  • Outputs sum to exactly 1.0
  • Order-preserving: if z_i > z_j, then σ(z)_i > σ(z)_j

Numerical Stability

Naive computation overflows for large logits. The numerically stable equivalent uses the identity:

σ(z)_i = e^{z_i − c} / Σⱼ e^{z_j − c}

where c = max(z). Subtracting c sets the maximum exponent to e⁰ = 1, preventing overflow.

z_maxe^{z_max}Stable?
1022,026Yes (float32 OK)
505.18 × 10²¹Yes (float32 OK)
881.65 × 10³⁸Borderline
1002.69 × 10⁴³Overflow → NaN
100 (stabilized)e^0 = 1Always stable

Temperature Scaling

TemperatureEffectUse Case
T = 0Argmax (greedy)Deterministic decoding
T = 0.7SharpenedHigh-confidence outputs
T = 1.0Original distributionStandard sampling
T = 1.5FlattenedMore diverse/creative outputs
T → ∞Uniform distributionMaximum randomness

Softmax in Attention

In self-attention, the score matrix Q·Kᵀ is divided by √d_k before softmax. For d_k=64, this scale factor is 8. Without it, large logits push the softmax into its saturation region where gradients are near zero.

d_k√d_k (scale)Logit variance before scalingAfter scaling
164d_k = 161
648d_k = 641
25616d_k = 2561
51222.6d_k = 5121

Gradient Properties

For a single output σ_i with respect to input z_j:

  • Self-gradient: ∂σ_i/∂z_i = σ_i(1 − σ_i) — maximum at σ_i = 0.5
  • Cross-gradient: ∂σ_i/∂z_j = −σ_i·σ_j for i≠j

The Jacobian is dense: each output depends on all inputs. This dense coupling is essential for attention — a high score for one key reduces attention on all others.

See self-attention-mechanism for where softmax appears in attention computation, and temperature-sampling for how temperature controls output diversity during inference.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why is the softmax numerically unstable without the max subtraction?

For large z_i, e^{z_i} can exceed float32's maximum (≈3.4×10³⁸ at z≈88). If any component overflows to infinity, the division produces NaN. Subtracting max(z) from all components before exponentiation keeps the largest exponent at e^0=1, guaranteeing no overflow while preserving the output distribution identically.

How does temperature affect softmax in language models?

Temperature T scales the logits before softmax: σ(z_i/T). At T=1 (default), the model's trained distribution is used. T<1 sharpens the distribution — at T=0, it becomes argmax (always picks the highest-probability token). T>1 flattens it toward uniform, increasing randomness and diversity. Most practical inference uses T between 0.7 and 1.2.

Why does attention use softmax specifically?

Attention requires a probability distribution over key positions — weights that are non-negative and sum to 1. Softmax is the standard way to achieve this from arbitrary real-valued scores. The exponential function ensures all weights are strictly positive, and the normalization ensures they sum to exactly 1.0. Alternative normalizations (sigmoid, sparsemax) have been explored but softmax remains dominant.

← All AI pages · Dashboard