Multi-Head Attention: Projection Matrices, Parameter Count, and Head Ablations
Multi-head attention uses h=8 heads with d_k=64 each; the base transformer's attention block contains ~1.05M parameters; ablations show 8 heads achieves 25.8 BLEU vs 24.9 for a single head (Vaswani et al., 2017).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Number of heads (base model) | 8 | heads | d_k = d_v = d_model/h = 512/8 = 64 |
| d_k per head | 64 | dimensions | Each of the 8 heads projects into a 64-dimensional subspace |
| Parameters per W_Q / W_K / W_V | 512 × 64 = 32,768 | parameters | Per head; all 8 heads together: 3 × 8 × 32,768 = 786,432 parameters |
| W_O projection parameters | 512 × 512 = 262,144 | parameters | Final output projection; maps concatenated 512-dim back to d_model=512 |
| Total attention block parameters | ~1,048,576 | parameters | 786,432 (input projections) + 262,144 (output projection) = 1,048,576 |
| BLEU with 1 head vs 8 heads | 24.9 vs 25.8 | BLEU | WMT EN-DE; single-head is 0.9 BLEU worse; ablation from Table 3 row A |
| BLEU with 16 heads (base model) | 25.1 | BLEU | 0.7 BLEU below 8-head optimum; too many heads hurts performance |
Multi-head attention wraps the scaled dot-product attention mechanism by running h parallel attention functions on learned linear projections of the inputs, then concatenating and reprojecting the results. Proposed by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), it allows the model to attend simultaneously to information from different representation subspaces.
The Formula
MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W_O
where headᵢ = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)
The projection matrices for each head i are:
- W_Qᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
- W_Kᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
- W_Vᵢ ∈ ℝ^{d_model × d_v} = ℝ^{512 × 64}
- W_O ∈ ℝ^{h·d_v × d_model} = ℝ^{512 × 512}
Parameter Count Per Attention Block
| Component | Shape | Parameters |
|---|---|---|
| W_Q (all heads) | 8 × (512 × 64) | 262,144 |
| W_K (all heads) | 8 × (512 × 64) | 262,144 |
| W_V (all heads) | 8 × (512 × 64) | 262,144 |
| W_O (output proj) | 512 × 512 | 262,144 |
| Total | 1,048,576 |
Base vs Big Model Head Configuration
| Hyperparameter | Base Model | Big Model |
|---|---|---|
| d_model | 512 | 1024 |
| Heads (h) | 8 | 16 |
| d_k = d_v | 64 | 64 |
| Encoder layers | 6 | 6 |
| Decoder layers | 6 | 6 |
| Dropout | 0.1 | 0.3 |
| Total parameters | 65M | 213M |
| WMT EN-DE BLEU | 27.3 | 28.4 |
Note: the big model uses 16 heads but keeps d_k=64 by doubling d_model to 1024. This means more distinct subspaces rather than wider projections per head.
Head Count Ablation (Table 3, Row A — Vaswani et al.)
The following results hold d_model=512 fixed while varying the number of heads, keeping total computation constant by adjusting d_k accordingly:
| Heads | d_k | WMT EN-DE BLEU |
|---|---|---|
| 1 | 512 | 24.9 |
| 4 | 128 | 25.5 |
| 8 | 64 | 25.8 |
| 16 | 32 | 25.1 |
| 32 | 16 | 25.4 |
The 8-head configuration is optimal. Single-head attention is 0.9 BLEU worse; too many heads with narrow d_k dimensions also underperforms, likely because 32 dimensions are insufficient for each head to learn meaningful projections.
What Different Heads Learn
Research by Voita et al. (2019) at ACL found that in a trained model, most attention heads are prunable with minimal performance loss, but a small set of specialized heads perform distinct functions: positional heads attend to adjacent tokens, syntactic heads track specific grammatical dependencies, and rare-word heads focus on low-frequency tokens. This functional specialization is what multiple heads enable.
See self-attention-mechanism for the dot-product attention formula inside each head, transformer-architecture for how this block sits within the full model, and feed-forward-layers for the other major parameter block in each transformer layer.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Voita et al. (2019) — Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting. ACL 2019
- Michel et al. (2019) — Are Sixteen Heads Really Better than One? NeurIPS 2019
Frequently Asked Questions
Why use multiple attention heads instead of one large attention operation?
Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head averages all this information, losing the ability to specialize. With h=8 heads, each head can learn to track different syntactic or semantic relationships simultaneously.
How does multi-head attention keep the total computation constant?
Each head operates on d_k = d_model/h dimensions, so the per-head computation is reduced proportionally. Running h heads at d_k = 64 each involves the same total floating-point operations as a single head at d_k = 512, while enabling richer, parallel representations.
What did ablation studies show about the optimal number of heads?
Vaswani et al. (2017) found in Table 3 that 8 heads achieves 25.8 BLEU on WMT EN-DE; single-head attention scores 24.9 BLEU (−0.9), 4 heads scores 25.5 BLEU, 16 heads scores 25.1 BLEU, and 32 heads scores 25.4 BLEU. Performance degrades at both extremes, suggesting 8 heads is the practical optimum for the base model.