Transformer Architecture: Encoder-Decoder Design and Dimensions
The original transformer uses 6 encoder and 6 decoder layers, d_model=512, 8 attention heads, and 64M parameters; trained on WMT 2014 English-German to achieve 28.4 BLEU (Vaswani et al., 2017).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Encoder layers | 6 | layers | Each layer has multi-head self-attention + feed-forward sublayers |
| Decoder layers | 6 | layers | Each layer has self-attention, cross-attention, and feed-forward sublayers |
| d_model (embedding dimension) | 512 | dimensions | Uniform across all sublayers for easy residual connections |
| Number of attention heads | 8 | heads | Each head operates on d_k = d_v = d_model/h = 64 dimensions |
| d_ff (feed-forward inner dimension) | 2048 | dimensions | 4× the model dimension; ReLU activation between two linear transforms |
| Total parameters (base) | 65 | million | Encoder-decoder base model; 'big' model: 213M parameters |
| BLEU score (WMT EN-DE) | 28.4 | BLEU | Best result in paper; surpassed all prior ensemble models |
| Training hardware | 8 × NVIDIA P100 | Base model trained for 100,000 steps (~12 hours) |
The transformer architecture, introduced by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), replaced recurrent and convolutional sequence models with a purely attention-based design. This architectural decision enabled greater parallelism during training and more effective modeling of long-range dependencies in text.
Core Architecture
The transformer consists of an encoder that maps an input sequence (x₁, …, x_n) to a continuous representation z = (z₁, …, z_n), and an autoregressive decoder that generates an output sequence (y₁, …, y_m) one element at a time, each step consuming the previously generated elements.
| Component | Base Model | Big Model |
|---|---|---|
| Encoder layers (N) | 6 | 6 |
| Decoder layers (N) | 6 | 6 |
| d_model | 512 | 1024 |
| d_ff | 2048 | 4096 |
| Attention heads (h) | 8 | 16 |
| d_k = d_v | 64 | 64 |
| Dropout | 0.1 | 0.3 |
| Total parameters | ~65M | ~213M |
| WMT EN-DE BLEU | 27.3 | 28.4 |
Encoder Stack
Each encoder layer contains two sublayers:
- Multi-head self-attention — each token attends to all other tokens in the input
- Position-wise feed-forward network — two linear transformations with ReLU: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
Each sublayer uses a residual connection and layer normalization: LayerNorm(x + Sublayer(x)).
Decoder Stack
Each decoder layer contains three sublayers:
- Masked multi-head self-attention — attends to previous output positions; masking prevents attending to future positions
- Multi-head cross-attention — attends over the encoder output (memory)
- Position-wise feed-forward network — same design as encoder
Scaling the Architecture
The original 512/6/8 configuration was chosen as a practical baseline. Subsequent work demonstrated that the architecture scales effectively:
- Increasing d_model, depth, and heads with more data and compute consistently improves performance
- BERT (2018) used encoder-only transformer; GPT used decoder-only
- Modern large language models are predominantly decoder-only transformers with the same core design, scaled to billions of parameters
See attention-is-all-you-need for the full paper summary, multi-head-attention for attention mechanism details, and scaling-laws for how performance scales with model size.
Related Pages
Sources
- Vaswani et al. (2017) — Attention Is All You Need. NeurIPS 2017
- Alammar, J. — The Illustrated Transformer (2018)
- Devlin et al. (2019) — BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019
Frequently Asked Questions
What are the key dimensions of the original transformer model?
The original transformer (Vaswani et al., 2017) uses d_model=512, 8 attention heads (each operating on d_k=d_v=64 dimensions), d_ff=2048 in the feed-forward layers, 6 encoder layers, and 6 decoder layers, totaling approximately 65 million parameters for the base model.
Why is d_model divided by the number of heads in multi-head attention?
Dividing d_model by the number of heads (h) ensures that each attention head operates on d_k = d_model/h dimensions, so the total computation is equivalent to a single full-dimensional attention head. This allows the model to attend to information from different representation subspaces at different positions without increasing the computational cost.
What was the significance of the transformer over previous sequence models?
Prior models like LSTMs and GRUs processed tokens sequentially, making parallelization during training difficult and limiting long-range dependency capture due to vanishing gradients. The transformer's self-attention mechanism attends to all positions simultaneously in O(1) sequential operations (vs O(n) for recurrent models), enabling much faster training and better long-range dependency modeling.