Module 3 Positional Encoding
Module 3 · Architecture

Positional Encoding

Attention doesn't know word order. Learn how sine and cosine waves inject position information into the model.

⏱️ 8 min 🎯 Lesson 8 of 11
Slide 1 of 5

The Order Problem

Attention is permutation invariant — it treats all tokens equally regardless of position. This means:

Without position info, these look identical to the model:

Dogbitesman
vs
Manbitesdog

❌ Very different meanings, but same set of tokens!

💡 We need to inject information about where each token appears in the sequence.
Slide 2 of 5

The Solution: Sinusoidal Encoding

The original Transformer uses sine and cosine functions at different frequencies to encode position:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

pos = position in the sequence (0, 1, 2, ...)
i = dimension index
d_model = embedding dimension (e.g., 512)
💡 Even dimensions get sin, odd dimensions get cos. Different frequencies create unique patterns for each position.
Slide 3 of 5

Why Sine Waves?

Sine waves have beautiful properties for encoding position:

1
Unique per position: Each position gets a unique pattern of sine/cosine values
2
Bounded: Values always between -1 and +1, same scale as embeddings
3
Relative positions: PE(pos+k) can be expressed as a linear function of PE(pos) — model can learn relative distances
4
Extrapolation: Works for any sequence length, even longer than seen during training
💡 Think of it like a multi-frequency radio signal. Low frequencies encode rough position (which sentence half), high frequencies encode fine position (which specific word).
Slide 4 of 5

Adding to Embeddings

The positional encoding is simply added to the token embedding:

input = token_embedding + positional_encoding
1
Token "cat" → embedding vector [0.2, -0.5, 0.8, ...]
2
Position 3 → PE vector [sin(3/1), cos(3/1), sin(3/100), ...]
3
Input = embedding + PE = [0.2+sin(3/1), -0.5+cos(3/1), ...]

The word "cat" at position 3 now has a different input than "cat" at position 7. The attention mechanism can use this difference to understand order!

Slide 5 of 5

Interactive: Positional Encoding Visualization

Each colored line shows a different dimension of the encoding across 50 positions. Notice how different frequencies create distinct patterns:

Low-frequency dimensions (dim 0, 1) change slowly — encoding coarse position. High-frequency dimensions change rapidly — encoding fine position.

✅ Modern LLMs (GPT-4, Llama) use Rotary Position Encoding (RoPE) — an improved version that handles much longer sequences. But the principle is the same!
1 / 5
🎵

The Musical Chord Analogy

Each position in the sequence is like a unique musical chord — a combination of different frequencies played together. Just as you can identify a chord by its sound, the model can identify a position by its pattern of sine/cosine values. No two positions have exactly the same "chord".

Key Concepts

📍
Positional Encoding
A vector added to each token embedding to encode its position in the sequence.
📈
Sinusoidal
Uses sin/cos at multiple frequencies — each position gets a unique fingerprint.
Addition
PE is added (not concatenated) to the token embedding, keeping the same dimension.
🔄
RoPE
Modern Rotary Position Encoding — used by LLaMA, GPT-NeoX, and other recent models.

Quick Check

Why does a Transformer need Positional Encoding?

A
Attention has no notion of order, so position must be explicitly added to embeddings
B
To increase the embedding dimension
C
To make training faster
D
To encode word frequency in the training data