Transformer Architecture: Encoder-Decoder Design and Dimensions

Category: architecture Updated: 2026-02-27

The original transformer uses 6 encoder and 6 decoder layers, d_model=512, 8 attention heads, and 64M parameters; trained on WMT 2014 English-German to achieve 28.4 BLEU (Vaswani et al., 2017).

Key Data Points
MeasureValueUnitNotes
Encoder layers6layersEach layer has multi-head self-attention + feed-forward sublayers
Decoder layers6layersEach layer has self-attention, cross-attention, and feed-forward sublayers
d_model (embedding dimension)512dimensionsUniform across all sublayers for easy residual connections
Number of attention heads8headsEach head operates on d_k = d_v = d_model/h = 64 dimensions
d_ff (feed-forward inner dimension)2048dimensions4× the model dimension; ReLU activation between two linear transforms
Total parameters (base)65millionEncoder-decoder base model; 'big' model: 213M parameters
BLEU score (WMT EN-DE)28.4BLEUBest result in paper; surpassed all prior ensemble models
Training hardware8 × NVIDIA P100Base model trained for 100,000 steps (~12 hours)

The transformer architecture, introduced by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), replaced recurrent and convolutional sequence models with a purely attention-based design. This architectural decision enabled greater parallelism during training and more effective modeling of long-range dependencies in text.

Core Architecture

The transformer consists of an encoder that maps an input sequence (x₁, …, x_n) to a continuous representation z = (z₁, …, z_n), and an autoregressive decoder that generates an output sequence (y₁, …, y_m) one element at a time, each step consuming the previously generated elements.

ComponentBase ModelBig Model
Encoder layers (N)66
Decoder layers (N)66
d_model5121024
d_ff20484096
Attention heads (h)816
d_k = d_v6464
Dropout0.10.3
Total parameters~65M~213M
WMT EN-DE BLEU27.328.4

Encoder Stack

Each encoder layer contains two sublayers:

  1. Multi-head self-attention — each token attends to all other tokens in the input
  2. Position-wise feed-forward network — two linear transformations with ReLU: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Each sublayer uses a residual connection and layer normalization: LayerNorm(x + Sublayer(x)).

Decoder Stack

Each decoder layer contains three sublayers:

  1. Masked multi-head self-attention — attends to previous output positions; masking prevents attending to future positions
  2. Multi-head cross-attention — attends over the encoder output (memory)
  3. Position-wise feed-forward network — same design as encoder

Scaling the Architecture

The original 512/6/8 configuration was chosen as a practical baseline. Subsequent work demonstrated that the architecture scales effectively:

  • Increasing d_model, depth, and heads with more data and compute consistently improves performance
  • BERT (2018) used encoder-only transformer; GPT used decoder-only
  • Modern large language models are predominantly decoder-only transformers with the same core design, scaled to billions of parameters

See attention-is-all-you-need for the full paper summary, multi-head-attention for attention mechanism details, and scaling-laws for how performance scales with model size.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What are the key dimensions of the original transformer model?

The original transformer (Vaswani et al., 2017) uses d_model=512, 8 attention heads (each operating on d_k=d_v=64 dimensions), d_ff=2048 in the feed-forward layers, 6 encoder layers, and 6 decoder layers, totaling approximately 65 million parameters for the base model.

Why is d_model divided by the number of heads in multi-head attention?

Dividing d_model by the number of heads (h) ensures that each attention head operates on d_k = d_model/h dimensions, so the total computation is equivalent to a single full-dimensional attention head. This allows the model to attend to information from different representation subspaces at different positions without increasing the computational cost.

What was the significance of the transformer over previous sequence models?

Prior models like LSTMs and GRUs processed tokens sequentially, making parallelization during training difficult and limiting long-range dependency capture due to vanishing gradients. The transformer's self-attention mechanism attends to all positions simultaneously in O(1) sequential operations (vs O(n) for recurrent models), enabling much faster training and better long-range dependency modeling.

← All AI pages · Dashboard