Sinusoidal Positional Encoding: Wavelengths, Extrapolation, and Learned vs Fixed Comparison

Category: architecture Updated: 2026-02-27

Sinusoidal positional encodings define PE(pos,2i)=sin(pos/10000^{2i/d_model}), with wavelengths from 2π to 10000·2π; Vaswani et al. (2017) found learned and fixed encodings achieve equivalent BLEU on WMT EN-DE.

Key Data Points
MeasureValueUnitNotes
Minimum wavelength (dimension 0)2π ≈ 6.28tokensPE(pos, 0) = sin(pos); completes one full cycle every 2π positions
Maximum wavelength (dimension d_model-2)10000 · 2π ≈ 62,832tokensLowest-frequency dimension; nearly static over typical sequence lengths
Wavelength at dimension 2i2π · 10000^{2i/d_model}Geometric progression; each pair of dimensions is 10000^{2/d_model} ≈ 1.044× longer
Number of encoding dimensions512dimensions256 sin + 256 cos pairs; same as d_model so encodings add directly to embeddings
Learned vs sinusoidal BLEU (WMT EN-DE)25.8 vs 25.8BLEUNo significant difference; Vaswani et al. Table 3 row E
Learned encoding max position≤ training lengthCannot extrapolate beyond seen positions; sinusoidal encodings extrapolate by construction
ALiBi (linear bias) extrapolation gain+2.0perplexity reductionPress et al. (2022) show ALiBi enables longer context at inference with no perplexity cost

Positional encoding solves a fundamental limitation of self-attention: the operation is permutation-equivariant by design, treating input tokens as a set rather than a sequence. Without explicit position information, a transformer cannot distinguish “the cat sat on the mat” from any permutation of those tokens.

The Sinusoidal Encoding Formula

For a token at position pos and encoding dimension index i (ranging from 0 to d_model/2 − 1):

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})

PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

These encodings are added elementwise to the token embeddings before the first layer. They have the same d_model=512 dimensions as the token embeddings, so no projection is needed.

Wavelength Table

The denominator 10000^{2i/512} defines the wavelength (period of one full oscillation) for each dimension pair:

Dimension pair 2iWavelength (tokens)Interpretation
02π ≈ 6.28Fast: resolves position within ≈6 tokens
642π · 10000^{0.25} ≈ 55.7Medium-fast
1282π · 10000^{0.5} ≈ 628Medium
2562π · 10000^{1.0} ≈ 6,283Slow
5102π · 10000^{1.99} ≈ 62,700Very slow: distinguishes positions across full document

The wavelengths form a geometric progression with ratio 10000^{2/512} ≈ 1.044 per dimension pair.

Learned vs Fixed Encodings

Vaswani et al. tested learned absolute positional embeddings (a trainable embedding matrix indexed by position) against the fixed sinusoidal formula:

Encoding typeWMT EN-DE BLEUExtrapolates to longer seqsParameters added
Sinusoidal (fixed)25.8Yes0
Learned absolute25.8Non_pos × d_model

Performance is identical on standard benchmarks, but the sinusoidal encoding is parameter-free and extrapolates to positions unseen during training.

Relative Position Information

A key property of the sinusoidal design is that for any fixed offset k, the encoding PE(pos+k) can be expressed as a linear function of PE(pos), because:

sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

This means the model can, in principle, learn relative position relationships by learning linear combinations of the encoding dimensions — though modern alternatives such as Rotary Position Embedding (RoPE, Su et al. 2024) and ALiBi (Press et al. 2022) encode relative positions more explicitly and often improve extrapolation performance.

Positional encodings interact directly with the attention weights. See self-attention-mechanism for how Q·Kᵀ scores are computed after encodings are added, and transformer-architecture for where encoding fits in the full pipeline.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why do sinusoidal encodings allow extrapolation to longer sequences?

The sinusoidal functions are defined for all real-valued positions. A model trained on sequences of length 512 can compute PE(600, i) using the same formula, producing a valid encoding. Learned positional embeddings are a lookup table indexed by position index — they have no entries for positions beyond the training length.

Why use both sine and cosine functions?

Using sin for even dimensions and cos for odd dimensions means that PE(pos+k) can always be expressed as a linear function of PE(pos) for any fixed offset k. This property lets the model learn to attend by relative offset. With only sine functions this decomposition breaks down.

How is the base 10000 chosen in the positional encoding formula?

The base 10000 was chosen empirically by Vaswani et al. to produce wavelengths that span from 2π (fast-varying, captures local position) to ~62,832 tokens (slow-varying, distinguishes very different positions). The geometric progression ensures roughly equal information across all frequency scales.

← All AI pages · Dashboard