Module 3 · Architecture

Positional Encoding

Attention doesn't know word order. Learn how sine and cosine waves inject position information into the model.

⏱️ 8 min 🎯 Lesson 8 of 11

Slide 1 of 5

The Order Problem

Attention is permutation invariant — it treats all tokens equally regardless of position. This means:

Without position info, these look identical to the model:

Dogbitesman

Manbitesdog

❌ Very different meanings, but same set of tokens!

                💡 We need to inject information about where each token appears in the sequence.
              

Slide 2 of 5

The Solution: Sinusoidal Encoding

The original Transformer uses sine and cosine functions at different frequencies to encode position:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

• pos = position in the sequence (0, 1, 2, ...)

• i = dimension index

• d_model = embedding dimension (e.g., 512)

                💡 Even dimensions get sin, odd dimensions get cos. Different frequencies create unique patterns for each position.
              

Slide 3 of 5

Why Sine Waves?

Sine waves have beautiful properties for encoding position:

Unique per position: Each position gets a unique pattern of sine/cosine values

Bounded: Values always between -1 and +1, same scale as embeddings

Relative positions: PE(pos+k) can be expressed as a linear function of PE(pos) — model can learn relative distances

Extrapolation: Works for any sequence length, even longer than seen during training

                💡 Think of it like a multi-frequency radio signal. Low frequencies encode rough position (which sentence half), high frequencies encode fine position (which specific word).
              

Slide 4 of 5

Adding to Embeddings

The positional encoding is simply added to the token embedding:

input = token_embedding + positional_encoding

Token "cat" → embedding vector [0.2, -0.5, 0.8, ...]

Position 3 → PE vector [sin(3/1), cos(3/1), sin(3/100), ...]

Input = embedding + PE = [0.2+sin(3/1), -0.5+cos(3/1), ...]

The word "cat" at position 3 now has a different input than "cat" at position 7. The attention mechanism can use this difference to understand order!

Slide 5 of 5

Interactive: Positional Encoding Visualization

Each colored line shows a different dimension of the encoding across 50 positions. Notice how different frequencies create distinct patterns:

Low-frequency dimensions (dim 0, 1) change slowly — encoding coarse position. High-frequency dimensions change rapidly — encoding fine position.

                ✅ Modern LLMs (GPT-4, Llama) use Rotary Position Encoding (RoPE) — an improved version that handles much longer sequences. But the principle is the same!
              

1 / 5

🎵

The Musical Chord Analogy

Each position in the sequence is like a unique musical chord — a combination of different frequencies played together. Just as you can identify a chord by its sound, the model can identify a position by its pattern of sine/cosine values. No two positions have exactly the same "chord".

Key Concepts

📍

Positional Encoding

A vector added to each token embedding to encode its position in the sequence.

📈

Sinusoidal

Uses sin/cos at multiple frequencies — each position gets a unique fingerprint.

➕

Addition

PE is added (not concatenated) to the token embedding, keeping the same dimension.

🔄

RoPE

Modern Rotary Position Encoding — used by LLaMA, GPT-NeoX, and other recent models.

Quick Check

Why does a Transformer need Positional Encoding?

Attention has no notion of order, so position must be explicitly added to embeddings

To increase the embedding dimension

To make training faster

To encode word frequency in the training data