Multi-Head Attention: Projection Matrices, Parameter Count, and Head Ablations

Name: Multi-Head Attention: Projection Matrices, Parameter Count, and Head Ablations
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

Multi-head attention uses h=8 heads with d_k=64 each; the base transformer's attention block contains ~1.05M parameters; ablations show 8 heads achieves 25.8 BLEU vs 24.9 for a single head (Vaswani et al., 2017).

Key Data Points
Measure	Value	Unit	Notes
Number of heads (base model)	8	heads	d_k = d_v = d_model/h = 512/8 = 64
d_k per head	64	dimensions	Each of the 8 heads projects into a 64-dimensional subspace
Parameters per W_Q / W_K / W_V	512 × 64 = 32,768	parameters	Per head; all 8 heads together: 3 × 8 × 32,768 = 786,432 parameters
W_O projection parameters	512 × 512 = 262,144	parameters	Final output projection; maps concatenated 512-dim back to d_model=512
Total attention block parameters	~1,048,576	parameters	786,432 (input projections) + 262,144 (output projection) = 1,048,576
BLEU with 1 head vs 8 heads	24.9 vs 25.8	BLEU	WMT EN-DE; single-head is 0.9 BLEU worse; ablation from Table 3 row A
BLEU with 16 heads (base model)	25.1	BLEU	0.7 BLEU below 8-head optimum; too many heads hurts performance

Multi-head attention wraps the scaled dot-product attention mechanism by running h parallel attention functions on learned linear projections of the inputs, then concatenating and reprojecting the results. Proposed by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), it allows the model to attend simultaneously to information from different representation subspaces.

The Formula

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W_O

where headᵢ = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)

The projection matrices for each head i are:

W_Qᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
W_Kᵢ ∈ ℝ^{d_model × d_k} = ℝ^{512 × 64}
W_Vᵢ ∈ ℝ^{d_model × d_v} = ℝ^{512 × 64}
W_O ∈ ℝ^{h·d_v × d_model} = ℝ^{512 × 512}

Parameter Count Per Attention Block

Component	Shape	Parameters
W_Q (all heads)	8 × (512 × 64)	262,144
W_K (all heads)	8 × (512 × 64)	262,144
W_V (all heads)	8 × (512 × 64)	262,144
W_O (output proj)	512 × 512	262,144
Total		1,048,576

Base vs Big Model Head Configuration

Hyperparameter	Base Model	Big Model
d_model	512	1024
Heads (h)	8	16
d_k = d_v	64	64
Encoder layers	6	6
Decoder layers	6	6
Dropout	0.1	0.3
Total parameters	65M	213M
WMT EN-DE BLEU	27.3	28.4

Note: the big model uses 16 heads but keeps d_k=64 by doubling d_model to 1024. This means more distinct subspaces rather than wider projections per head.

Head Count Ablation (Table 3, Row A — Vaswani et al.)

The following results hold d_model=512 fixed while varying the number of heads, keeping total computation constant by adjusting d_k accordingly:

Heads	d_k	WMT EN-DE BLEU
1	512	24.9
4	128	25.5
8	64	25.8
16	32	25.1
32	16	25.4

The 8-head configuration is optimal. Single-head attention is 0.9 BLEU worse; too many heads with narrow d_k dimensions also underperforms, likely because 32 dimensions are insufficient for each head to learn meaningful projections.

What Different Heads Learn

Research by Voita et al. (2019) at ACL found that in a trained model, most attention heads are prunable with minimal performance loss, but a small set of specialized heads perform distinct functions: positional heads attend to adjacent tokens, syntactic heads track specific grammatical dependencies, and rare-word heads focus on low-frequency tokens. This functional specialization is what multiple heads enable.

See self-attention-mechanism for the dot-product attention formula inside each head, transformer-architecture for how this block sits within the full model, and feed-forward-layers for the other major parameter block in each transformer layer.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why use multiple attention heads instead of one large attention operation?

Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. A single head averages all this information, losing the ability to specialize. With h=8 heads, each head can learn to track different syntactic or semantic relationships simultaneously.

How does multi-head attention keep the total computation constant?

Each head operates on d_k = d_model/h dimensions, so the per-head computation is reduced proportionally. Running h heads at d_k = 64 each involves the same total floating-point operations as a single head at d_k = 512, while enabling richer, parallel representations.

What did ablation studies show about the optimal number of heads?

Vaswani et al. (2017) found in Table 3 that 8 heads achieves 25.8 BLEU on WMT EN-DE; single-head attention scores 24.9 BLEU (−0.9), 4 heads scores 25.5 BLEU, 16 heads scores 25.1 BLEU, and 32 heads scores 25.4 BLEU. Performance degrades at both extremes, suggesting 8 heads is the practical optimum for the base model.

← All AI pages · Dashboard