Attention Is All You Need: The Transformer Paper — Key Results and Impact

Category: representation Updated: 2026-02-27

Vaswani et al. (NeurIPS 2017) introduced the transformer architecture, achieving 28.4 BLEU on WMT EN-DE — surpassing all prior models including ensembles — with a 64M parameter model trained for 12 hours on 8 P100 GPUs.

Key Data Points
MeasureValueUnitNotes
WMT EN-DE BLEU (transformer big)28.4BLEUState of the art at publication; surpassed all prior single-model and ensemble results
WMT EN-FR BLEU (transformer big)41.8BLEUTrained on 36M sentence pairs; outperformed all prior models
Base model training time12 hours100K steps on 8 × NVIDIA P100 GPUs; big model trained for 3.5 days
Training cost (base model)3.3 × 10¹⁸FLOPsBig model: 2.3 × 10¹⁹ FLOPs; dramatically less than prior LSTM-based systems
Previous SOTA (GNMT+RL ensemble)26.30BLEU EN-DEWu et al. (2016) Google NMT ensemble; transformer single model exceeded this
Paper citation count130,000+citationsAs of 2025; one of the most-cited machine learning papers

“Attention Is All You Need” by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin (NeurIPS 2017) introduced the transformer architecture and demonstrated that recurrence could be entirely eliminated from sequence modeling. The paper is foundational to all subsequent large language model development.

Core Contribution

Prior state-of-the-art neural machine translation used encoder-decoder architectures with LSTMs augmented by attention (Bahdanau et al., 2015; Wu et al., 2016). These models processed tokens sequentially — each hidden state depended on the previous — preventing training parallelization.

The transformer replaced all recurrent components with self-attention, enabling:

  • Full parallelization across sequence positions during training
  • Direct long-range connections between any two positions in O(1) steps
  • Significantly faster training — 3.5 days vs weeks for comparable LSTM systems

Benchmark Results (Table 2 — Vaswani et al.)

ModelWMT EN-DE BLEUWMT EN-FR BLEUTraining Cost (FLOPs)
GNMT+RL (ensemble, 2016)26.3041.16~10²⁰
ConvS2S (ensemble, 2017)26.3640.46
Transformer (base)27.33.3 × 10¹⁸
Transformer (big)28.441.82.3 × 10¹⁹

The transformer big model exceeded all prior ensembles as a single model with less total compute.

Ablation Study Results (Table 3 — Selected Rows)

ConfigurationWMT EN-DE BLEUNotes
Full base model (N=6, h=8, d_k=64)25.8Reference
Single head (h=1, d_k=512)24.9−0.9 BLEU
16 heads (h=16, d_k=32)25.1−0.7 BLEU
Learned positional encoding25.8Equivalent to sinusoidal
No dropout24.6−1.2 BLEU
d_k = 16 (vs 64)24.9−0.9 BLEU

Training Configuration

HyperparameterBase ModelBig Model
OptimizerAdamAdam
β₁0.90.9
β₂0.980.98
ε10⁻⁹10⁻⁹
Warmup steps4,0004,000
Learning rate formulad_model^{−0.5} · min(step^{−0.5}, step · warmup^{−1.5})
Dropout0.10.3
Label smoothing0.10.1
Training steps100,000300,000

The warmup schedule increases the learning rate linearly for the first warmup_steps, then decreases it proportionally to the inverse square root of step number — a specific choice validated by the authors.

See transformer-architecture for a detailed walkthrough of the model dimensions, scaling-laws for how the principles established here were extended to larger models, and pre-training for how the transformer paradigm was extended to self-supervised training.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What was the key innovation of 'Attention Is All You Need'?

The paper eliminated recurrence and convolutions entirely, building a sequence-to-sequence model using only attention mechanisms. This enabled full parallelization during training (unlike RNNs which process tokens sequentially), dramatically reducing training time. The multi-head self-attention mechanism allowed each position to directly attend to all other positions in O(1) operations, solving the long-range dependency problem that plagued LSTMs.

How did the transformer compare to prior LSTM-based systems?

The transformer big model achieved 28.4 BLEU on WMT EN-DE, compared to the prior best ensemble model (GNMT+RL) at 26.30 BLEU — a 2.1 BLEU improvement. More significantly, it achieved this in 3.5 days of training (2.3×10¹⁹ FLOPs) whereas GNMT required weeks. The base transformer (27.3 BLEU, 12 hours, 3.3×10¹⁸ FLOPs) already outperformed most prior single models.

What architectural choices were validated in the paper's ablations?

Table 3 of the paper systematically ablated: number of attention heads (optimal 8), key dimension d_k (smaller hurts more than larger), dropout (0.1 optimal), positional encoding type (learned vs sinusoidal equivalent), and residual dropout. The ablations confirmed that multiple attention heads and the specific scaling of d_k are critical design choices, not arbitrary hyperparameters.

← All AI pages · Dashboard