Sinusoidal Positional Encoding: Wavelengths, Extrapolation, and Learned vs Fixed Comparison

Name: Sinusoidal Positional Encoding: Wavelengths, Extrapolation, and Learned vs Fixed Comparison
Creator: AI Tower
Published: 2026-02-27

Category: architecture Updated: 2026-02-27

Sinusoidal positional encodings define PE(pos,2i)=sin(pos/10000^{2i/d_model}), with wavelengths from 2π to 10000·2π; Vaswani et al. (2017) found learned and fixed encodings achieve equivalent BLEU on WMT EN-DE.

Key Data Points
Measure	Value	Unit	Notes
Minimum wavelength (dimension 0)	2π ≈ 6.28	tokens	PE(pos, 0) = sin(pos); completes one full cycle every 2π positions
Maximum wavelength (dimension d_model-2)	10000 · 2π ≈ 62,832	tokens	Lowest-frequency dimension; nearly static over typical sequence lengths
Wavelength at dimension 2i	2π · 10000^{2i/d_model}		Geometric progression; each pair of dimensions is 10000^{2/d_model} ≈ 1.044× longer
Number of encoding dimensions	512	dimensions	256 sin + 256 cos pairs; same as d_model so encodings add directly to embeddings
Learned vs sinusoidal BLEU (WMT EN-DE)	25.8 vs 25.8	BLEU	No significant difference; Vaswani et al. Table 3 row E
Learned encoding max position	≤ training length		Cannot extrapolate beyond seen positions; sinusoidal encodings extrapolate by construction
ALiBi (linear bias) extrapolation gain	+2.0	perplexity reduction	Press et al. (2022) show ALiBi enables longer context at inference with no perplexity cost

Positional encoding solves a fundamental limitation of self-attention: the operation is permutation-equivariant by design, treating input tokens as a set rather than a sequence. Without explicit position information, a transformer cannot distinguish “the cat sat on the mat” from any permutation of those tokens.

The Sinusoidal Encoding Formula

For a token at position pos and encoding dimension index i (ranging from 0 to d_model/2 − 1):

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})

PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

These encodings are added elementwise to the token embeddings before the first layer. They have the same d_model=512 dimensions as the token embeddings, so no projection is needed.

Wavelength Table

The denominator 10000^{2i/512} defines the wavelength (period of one full oscillation) for each dimension pair:

Dimension pair 2i	Wavelength (tokens)	Interpretation
0	2π ≈ 6.28	Fast: resolves position within ≈6 tokens
64	2π · 10000^{0.25} ≈ 55.7	Medium-fast
128	2π · 10000^{0.5} ≈ 628	Medium
256	2π · 10000^{1.0} ≈ 6,283	Slow
510	2π · 10000^{1.99} ≈ 62,700	Very slow: distinguishes positions across full document

The wavelengths form a geometric progression with ratio 10000^{2/512} ≈ 1.044 per dimension pair.

Learned vs Fixed Encodings

Vaswani et al. tested learned absolute positional embeddings (a trainable embedding matrix indexed by position) against the fixed sinusoidal formula:

Encoding type	WMT EN-DE BLEU	Extrapolates to longer seqs	Parameters added
Sinusoidal (fixed)	25.8	Yes	0
Learned absolute	25.8	No	n_pos × d_model

Performance is identical on standard benchmarks, but the sinusoidal encoding is parameter-free and extrapolates to positions unseen during training.

Relative Position Information

A key property of the sinusoidal design is that for any fixed offset k, the encoding PE(pos+k) can be expressed as a linear function of PE(pos), because:

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

This means the model can, in principle, learn relative position relationships by learning linear combinations of the encoding dimensions — though modern alternatives such as Rotary Position Embedding (RoPE, Su et al. 2024) and ALiBi (Press et al. 2022) encode relative positions more explicitly and often improve extrapolation performance.

Positional encodings interact directly with the attention weights. See self-attention-mechanism for how Q·Kᵀ scores are computed after encodings are added, and transformer-architecture for where encoding fits in the full pipeline.

🧠 🧠 🧠

Sources

Frequently Asked Questions

Why do sinusoidal encodings allow extrapolation to longer sequences?

The sinusoidal functions are defined for all real-valued positions. A model trained on sequences of length 512 can compute PE(600, i) using the same formula, producing a valid encoding. Learned positional embeddings are a lookup table indexed by position index — they have no entries for positions beyond the training length.

Why use both sine and cosine functions?

Using sin for even dimensions and cos for odd dimensions means that PE(pos+k) can always be expressed as a linear function of PE(pos) for any fixed offset k. This property lets the model learn to attend by relative offset. With only sine functions this decomposition breaks down.

How is the base 10000 chosen in the positional encoding formula?

The base 10000 was chosen empirically by Vaswani et al. to produce wavelengths that span from 2π (fast-varying, captures local position) to ~62,832 tokens (slow-varying, distinguishes very different positions). The geometric progression ensures roughly equal information across all frequency scales.

← All AI pages · Dashboard