Module 2 Self-Attention
Module 2 · Attention

Self-Attention in Practice

Dive deep into self-attention — where a sequence attends to itself. See the step-by-step calculation with real numbers.

⏱️ 12 min 🎯 Lesson 6 of 11 ⭐ Core Concept
Slide 1 of 6

What is Self-Attention?

In self-attention, the Queries, Keys, and Values all come from the same sequence. Every word attends to every other word in the same sentence — including itself.

💡 "Self" means the sequence is attending to itself. It's not comparing to an external database — it's exploring its own internal relationships.

This allows the model to understand how every word in a sentence relates to every other word, all at once.

Every word creates 3 projections from its embedding:

embedding
×W_Q
Query
embedding
×W_K
Key
embedding
×W_V
Value
Slide 2 of 6

The Weight Matrices W_Q, W_K, W_V

Each word embedding is multiplied by three different learned weight matrices to produce Q, K, and V:

Q
W_Q — Projects embeddings into "what I'm looking for" space
K
W_K — Projects embeddings into "what I have to offer" space
V
W_V — Projects embeddings into "my actual information" space
Weight matrix sizes
W_Q: [d_model × d_k]
W_K: [d_model × d_k]
W_V: [d_model × d_v]
d_model=512, d_k=d_v=64
(in original paper)
💡 These matrices W_Q, W_K, W_V are learned during training. The model learns what types of relationships are useful to pay attention to.
Slide 3 of 6

Step by Step: Computing Self-Attention

Let's trace through a simple 3-word sentence: "I love AI"

1
Each word gets an embedding: I=[1,0], love=[0,1], AI=[1,1] (simplified)
2
Multiply each by W_Q, W_K, W_V → get Q, K, V vectors for each word
3
For each Query, compute dot products with all Keys → raw scores
4
Divide by √d_k (e.g., √64=8), apply softmax → attention weights
5
Multiply weights by Values and sum → new enriched representation

The output for each word is a weighted mix of all other words' values — enriched with context!

Slide 4 of 6

Visualizing What Gets Learned

After training, attention patterns reveal meaningful linguistic structure. Here's what different words attend to in "The animal was tired":

Attention patterns (simplified):

The → attends mostly to → animal (its noun)
animal → attends mostly to → tired (its adjective)
tired → attends mostly to → animal (what's tired)
✅ The model learns to connect semantically related words without being explicitly programmed to do so!
Slide 5 of 6

Self-Attention: Interactive Visualization

The bar chart below shows how "tired" distributes its attention across the sentence "The animal was tired":

🔍 "tired" pays the most attention (65%) to "animal" — because semantically, "animal" is the subject of being tired. This is how the model resolves meaning!
Slide 6 of 6

Why "Self"-Attention?

The beauty of self-attention is that it's universal — it works for any relationship in a sequence:

1
Coreference: "it" → "the animal" (pronoun resolution)
2
Agreement: "cats" → "are" (subject-verb agreement)
3
Modification: "big" → "dog" (adjective-noun)
4
Semantic: "Paris" → "France" (entity-location)
💡 The model isn't programmed with grammar rules — it discovers these relationships by itself during training, purely through learning the best way to predict the next word!
1 / 6
🌐

The Meeting Room Analogy

Imagine every word in a sentence is a person in a meeting room. In self-attention, every person can send a message directly to every other person. Each person reads all messages and decides how much to listen to each one. The output is everyone's updated understanding based on the whole group's input.

Key Concepts

🔄
Self-Attention
Attention where Q, K, V all come from the same sequence — the sequence attends to itself.
🧮
W_Q, W_K, W_V
Learned weight matrices that project embeddings into Q, K, V spaces.
📊
Attention Matrix
An N×N matrix showing how much each token attends to each other token.
🌊
Contextual Embeddings
After self-attention, each token's representation includes information from all other tokens.

Quick Check

In self-attention, where do the Query, Key, and Value vectors come from?

A
Q from input, K and V from a separate encoder
B
All three (Q, K, V) come from the same input sequence via different weight matrices
C
They are randomly initialized and not learned
D
Q from positional encoding, K and V from embeddings