Slide 1 of 7
What is Attention?
Attention is a mechanism that lets a model selectively focus on different parts of the input when processing each element.
📖
Human Reading Analogy
When you read "The bank was steep and muddy from the rain", you look back at context to determine whether "bank" means a financial institution or a river bank. You attend to "steep", "muddy", and "rain" to understand "bank".
Transformers do the same thing mathematically — every token computes a weighted blend of all other tokens based on relevance.
Slide 2 of 7
The Library Analogy
Attention can be understood as a fuzzy database lookup:
Q
Query — What you're looking for. "I want books about machine learning"
K
Keys — Index cards describing each item. "Neural Networks", "Cooking", "History"...
V
Values — The actual content of each item. The books themselves.
Query
⊗
Keys
→
Scores
→
Softmax
→
Weighted Values
Slide 3 of 7
Step 1: Compute Attention Scores
For each query token, we compute how similar it is to every key (including itself). Similarity = dot product.
score(Q, K) = Q · Kᵀ
Example — Computing scores for the word "tired":
"tired" vs "The":0.1 (low relevance)
"tired" vs "animal":0.9 (HIGH — animal is what's tired!)
"tired" vs "was":0.2
"tired" vs "tired":0.7 (self-attention)
💡 The dot product is high when two vectors point in similar directions — meaning the Query and Key are semantically similar.
Slide 4 of 7
Step 2: Scale and Softmax
Raw dot products can get very large, causing gradients to vanish. So we scale them first:
scaled_score = Q · Kᵀ / √d_k
Where d_k is the dimension of the key vectors (e.g., 64). This keeps values in a stable range.
Then we apply Softmax to convert scores into probabilities that sum to 1:
attention_weights = softmax(scaled_score)
After softmax, "tired" → all tokens:
The: 5%
animal: 65%
was: 10%
tired: 20%
✅ All weights sum to 100%
Slide 5 of 7
Step 3: Weighted Sum of Values
Finally, multiply each Value vector by its attention weight and sum them all up:
output = Σ (attention_weight × Value)
For "tired" computing its output vector:
output = 0.65 × V("animal")
+ 0.20 × V("tired")
+ 0.10 × V("was")
+ 0.05 × V("The")
The output for "tired" is mostly influenced by "animal" (65%) — so the model now understands that "tired" relates to the animal!
Slide 6 of 7
The Complete Formula
Putting it all together in one elegant equation:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
1
QKᵀ — Dot product of all queries and keys (similarity scores)
2
/ √d_k — Scale to prevent exploding values
3
softmax() — Convert scores to probabilities (sum to 1)
4
· V — Weight and sum the value vectors
💡 This simple formula is done in parallel for ALL tokens simultaneously using matrix multiplication — incredibly efficient on GPUs!
Slide 7 of 7
Attention is Differentiable
A key property: attention weights are computed by a smooth, differentiable function. This means the model can learn which things to attend to through gradient descent.
What gets learned:
✓
Which words are relevant to which
✓
Syntactic relationships (subject-verb)
✓
Semantic connections (pronoun-noun)
Magic of gradients
The model doesn't need to be told what to attend to — it figures it out during training from millions of examples!