Scaled Dot-Product Attention: Formula, Complexity, and the √d_k Scaling Factor

Category: architecture Updated: 2026-02-27

Scaled dot-product attention computes softmax(Q·Kᵀ/√d_k)·V, scaling by √d_k=8 to prevent vanishing gradients; time complexity is O(n²·d), quadratic in sequence length n (Vaswani et al., 2017).

Key Data Points
MeasureValueUnitNotes
Attention formulasoftmax(Q·Kᵀ/√d_k)·VScaled dot-product; scaling by √d_k prevents softmax saturation
d_k (key/query dimension)64dimensionsd_model/h = 512/8 = 64; √64 = 8 is the scaling divisor
Time complexityO(n²·d)Quadratic in sequence length n; bottleneck for long sequences
Memory complexityO(n²)Attention matrix is n×n; 1024-token sequence needs ~4MB per head at float32
Sequential operationsO(1)All token interactions computed in parallel; vs O(n) for RNNs
Path length (max signal)O(1)Any token pair connected in one operation; RNNs require O(n) steps

Scaled dot-product attention is the core computational primitive of the transformer. Introduced by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), it computes a weighted combination of value vectors, where the weights come from the compatibility between query and key vectors.

The Formula

The attention function maps a set of queries Q, keys K, and values V to an output:

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

The queries, keys, and values are matrices: Q and K have dimension n×d_k, and V has dimension n×d_v. In the base transformer model, d_k = d_v = 64 and d_model = 512.

The factor √d_k = 8 is the scaling divisor. Without it, the dot products Q·Kᵀ grow large in magnitude as d_k increases, driving the softmax into near-zero-gradient saturation.

Complexity Comparison

OperationTime ComplexityMemory ComplexitySequential OpsMax Path Length
Self-AttentionO(n²·d)O(n²)O(1)O(1)
Recurrent (RNN)O(n·d²)O(d)O(n)O(n)
Convolutional (k=kernel)O(k·n·d²)O(k·n·d)O(log n)O(log n / log k)
Self-Attn (restricted)O(r·n·d)O(r·n)O(1)O(n/r)

The quadratic memory in n is the defining constraint: a 4096-token sequence requires 256× more memory than a 256-token sequence in the attention matrix alone.

Why Dot Products Scale with √d_k

Consider d_k independent random variables with mean 0 and variance 1. Their dot product has variance d_k. The standard deviation — and thus the typical scale of the input to softmax — grows as √d_k. Dividing by √d_k normalizes this variance back to 1, keeping gradients well-behaved throughout training.

Computation Breakdown

For the base model’s multi-head attention block with n=512 tokens, h=8 heads, d_k=64:

StepShapeOperations
Q = X·W_Q(512, 64)512×512×64 = 16.8M multiplications
K = X·W_K(512, 64)512×512×64 = 16.8M multiplications
Attention scores Q·Kᵀ(512, 512)512×512×64 = 16.8M multiplications
Softmax + V weighting(512, 64)512×512×64 = 16.8M multiplications

Each head performs roughly 67M multiply-accumulate operations; across 8 heads, that is ~537M per attention sublayer.

Self-Attention vs Additive Attention

The dot-product approach is faster and more space-efficient than the additive attention mechanism (Bahdanau et al., 2015), which uses a feed-forward network with learnable parameter vector v: score(Q, K) = vᵀ·tanh(W_Q·Q + W_K·K). In practice at large d_k, the scaled dot-product formulation is faster due to optimized matrix multiplication routines.

For extensions that reduce the O(n²) cost, see encoder-decoder-architecture and for the multi-head wrapper around this function see multi-head-attention. The positional-encoding page explains how token positions are injected before attention is computed.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why divide by √d_k in scaled dot-product attention?

For large d_k, the dot products Q·Kᵀ grow in magnitude proportional to √d_k, pushing the softmax into regions of very small gradients. Dividing by √d_k keeps the inputs to softmax in a moderate range regardless of dimensionality, preventing near-zero gradients during training.

Why is self-attention O(n²) and why does that matter?

Computing the full n×n attention matrix between all pairs of tokens requires O(n²) operations and O(n²) memory. For a sequence of 1024 tokens with d_model=512, the attention matrix alone consumes roughly 4MB per head at float32. This quadratic scaling is the primary constraint on context length for standard transformers.

How does self-attention compare to RNNs for long-range dependencies?

Self-attention connects any two tokens in O(1) operations regardless of their distance in the sequence. Recurrent networks require O(n) sequential steps for a signal to travel between distant tokens, making them vulnerable to vanishing gradients over long dependencies. This is one of the key advantages of the transformer.

← All AI pages · Dashboard