Attention doesn't know word order. Learn how sine and cosine waves inject position information into the model.
⏱️ 8 min🎯 Lesson 8 of 11
Slide 1 of 5
The Order Problem
Attention is permutation invariant — it treats all tokens equally regardless of position. This means:
Without position info, these look identical to the model:
Dogbitesman
vs
Manbitesdog
❌ Very different meanings, but same set of tokens!
💡 We need to inject information about where each token appears in the sequence.
Slide 2 of 5
The Solution: Sinusoidal Encoding
The original Transformer uses sine and cosine functions at different frequencies to encode position:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
• pos = position in the sequence (0, 1, 2, ...)
• i = dimension index
• d_model = embedding dimension (e.g., 512)
💡 Even dimensions get sin, odd dimensions get cos. Different frequencies create unique patterns for each position.
Slide 3 of 5
Why Sine Waves?
Sine waves have beautiful properties for encoding position:
1
Unique per position: Each position gets a unique pattern of sine/cosine values
2
Bounded: Values always between -1 and +1, same scale as embeddings
3
Relative positions: PE(pos+k) can be expressed as a linear function of PE(pos) — model can learn relative distances
4
Extrapolation: Works for any sequence length, even longer than seen during training
💡 Think of it like a multi-frequency radio signal. Low frequencies encode rough position (which sentence half), high frequencies encode fine position (which specific word).
Slide 4 of 5
Adding to Embeddings
The positional encoding is simply added to the token embedding:
✅ Modern LLMs (GPT-4, Llama) use Rotary Position Encoding (RoPE) — an improved version that handles much longer sequences. But the principle is the same!
1 / 5
🎵
The Musical Chord Analogy
Each position in the sequence is like a unique musical chord — a combination of different frequencies played together. Just as you can identify a chord by its sound, the model can identify a position by its pattern of sine/cosine values. No two positions have exactly the same "chord".
Key Concepts
📍
Positional Encoding
A vector added to each token embedding to encode its position in the sequence.
📈
Sinusoidal
Uses sin/cos at multiple frequencies — each position gets a unique fingerprint.
➕
Addition
PE is added (not concatenated) to the token embedding, keeping the same dimension.
🔄
RoPE
Modern Rotary Position Encoding — used by LLaMA, GPT-NeoX, and other recent models.
Quick Check
Why does a Transformer need Positional Encoding?
A
Attention has no notion of order, so position must be explicitly added to embeddings