Slide 1 of 6
What is Self-Attention?
In self-attention, the Queries, Keys, and Values all come from the same sequence. Every word attends to every other word in the same sentence — including itself.
💡 "Self" means the sequence is attending to itself. It's not comparing to an external database — it's exploring its own internal relationships.
This allows the model to understand how every word in a sentence relates to every other word, all at once.
Every word creates 3 projections from its embedding:
Slide 2 of 6
The Weight Matrices W_Q, W_K, W_V
Each word embedding is multiplied by three different learned weight matrices to produce Q, K, and V:
Q
W_Q — Projects embeddings into "what I'm looking for" space
K
W_K — Projects embeddings into "what I have to offer" space
V
W_V — Projects embeddings into "my actual information" space
Weight matrix sizes
W_Q: [d_model × d_k]
W_K: [d_model × d_k]
W_V: [d_model × d_v]
d_model=512, d_k=d_v=64
(in original paper)
💡 These matrices W_Q, W_K, W_V are learned during training. The model learns what types of relationships are useful to pay attention to.
Slide 3 of 6
Step by Step: Computing Self-Attention
Let's trace through a simple 3-word sentence: "I love AI"
1
Each word gets an embedding: I=[1,0], love=[0,1], AI=[1,1] (simplified)
2
Multiply each by W_Q, W_K, W_V → get Q, K, V vectors for each word
3
For each Query, compute dot products with all Keys → raw scores
4
Divide by √d_k (e.g., √64=8), apply softmax → attention weights
5
Multiply weights by Values and sum → new enriched representation
The output for each word is a weighted mix of all other words' values — enriched with context!
Slide 4 of 6
Visualizing What Gets Learned
After training, attention patterns reveal meaningful linguistic structure. Here's what different words attend to in "The animal was tired":
Attention patterns (simplified):
The
→ attends mostly to →
animal
(its noun)
animal
→ attends mostly to →
tired
(its adjective)
tired
→ attends mostly to →
animal
(what's tired)
✅ The model learns to connect semantically related words without being explicitly programmed to do so!
Slide 5 of 6
Self-Attention: Interactive Visualization
The bar chart below shows how "tired" distributes its attention across the sentence "The animal was tired":
🔍 "tired" pays the most attention (65%) to "animal" — because semantically, "animal" is the subject of being tired. This is how the model resolves meaning!
Slide 6 of 6
Why "Self"-Attention?
The beauty of self-attention is that it's universal — it works for any relationship in a sequence:
1
Coreference: "it" → "the animal" (pronoun resolution)
2
Agreement: "cats" → "are" (subject-verb agreement)
3
Modification: "big" → "dog" (adjective-noun)
4
Semantic: "Paris" → "France" (entity-location)
💡 The model isn't programmed with grammar rules — it discovers these relationships by itself during training, purely through learning the best way to predict the next word!