Transformer Architecture: Encoder-Decoder Design and Dimensions

Name: Transformer Architecture: Encoder-Decoder Design and Dimensions
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

The original transformer uses 6 encoder and 6 decoder layers, d_model=512, 8 attention heads, and 64M parameters; trained on WMT 2014 English-German to achieve 28.4 BLEU (Vaswani et al., 2017).

Key Data Points
Measure	Value	Unit	Notes
Encoder layers	6	layers	Each layer has multi-head self-attention + feed-forward sublayers
Decoder layers	6	layers	Each layer has self-attention, cross-attention, and feed-forward sublayers
d_model (embedding dimension)	512	dimensions	Uniform across all sublayers for easy residual connections
Number of attention heads	8	heads	Each head operates on d_k = d_v = d_model/h = 64 dimensions
d_ff (feed-forward inner dimension)	2048	dimensions	4× the model dimension; ReLU activation between two linear transforms
Total parameters (base)	65	million	Encoder-decoder base model; 'big' model: 213M parameters
BLEU score (WMT EN-DE)	28.4	BLEU	Best result in paper; surpassed all prior ensemble models
Training hardware	8 × NVIDIA P100		Base model trained for 100,000 steps (~12 hours)

The transformer architecture, introduced by Vaswani et al. in “Attention Is All You Need” (NeurIPS 2017), replaced recurrent and convolutional sequence models with a purely attention-based design. This architectural decision enabled greater parallelism during training and more effective modeling of long-range dependencies in text.

Core Architecture

The transformer consists of an encoder that maps an input sequence (x₁, …, x_n) to a continuous representation z = (z₁, …, z_n), and an autoregressive decoder that generates an output sequence (y₁, …, y_m) one element at a time, each step consuming the previously generated elements.

Component	Base Model	Big Model
Encoder layers (N)	6	6
Decoder layers (N)	6	6
d_model	512	1024
d_ff	2048	4096
Attention heads (h)	8	16
d_k = d_v	64	64
Dropout	0.1	0.3
Total parameters	~65M	~213M
WMT EN-DE BLEU	27.3	28.4

Encoder Stack

Each encoder layer contains two sublayers:

Multi-head self-attention — each token attends to all other tokens in the input
Position-wise feed-forward network — two linear transformations with ReLU: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Each sublayer uses a residual connection and layer normalization: LayerNorm(x + Sublayer(x)).

Decoder Stack

Each decoder layer contains three sublayers:

Masked multi-head self-attention — attends to previous output positions; masking prevents attending to future positions
Multi-head cross-attention — attends over the encoder output (memory)
Position-wise feed-forward network — same design as encoder

Scaling the Architecture

The original 512/6/8 configuration was chosen as a practical baseline. Subsequent work demonstrated that the architecture scales effectively:

Increasing d_model, depth, and heads with more data and compute consistently improves performance
BERT (2018) used encoder-only transformer; GPT used decoder-only
Modern large language models are predominantly decoder-only transformers with the same core design, scaled to billions of parameters

See attention-is-all-you-need for the full paper summary, multi-head-attention for attention mechanism details, and scaling-laws for how performance scales with model size.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What are the key dimensions of the original transformer model?

The original transformer (Vaswani et al., 2017) uses d_model=512, 8 attention heads (each operating on d_k=d_v=64 dimensions), d_ff=2048 in the feed-forward layers, 6 encoder layers, and 6 decoder layers, totaling approximately 65 million parameters for the base model.

Why is d_model divided by the number of heads in multi-head attention?

Dividing d_model by the number of heads (h) ensures that each attention head operates on d_k = d_model/h dimensions, so the total computation is equivalent to a single full-dimensional attention head. This allows the model to attend to information from different representation subspaces at different positions without increasing the computational cost.

What was the significance of the transformer over previous sequence models?

Prior models like LSTMs and GRUs processed tokens sequentially, making parallelization during training difficult and limiting long-range dependency capture due to vanishing gradients. The transformer's self-attention mechanism attends to all positions simultaneously in O(1) sequential operations (vs O(n) for recurrent models), enabling much faster training and better long-range dependency modeling.

← All AI pages · Dashboard