Module 2 Attention Mechanism
Module 2 · Attention

The Attention Mechanism

The core innovation of Transformers. Learn how "attention" lets every word focus on what matters most.

⏱️ 12 min 🎯 Lesson 5 of 11 ⭐ Core Concept
Slide 1 of 7

What is Attention?

Attention is a mechanism that lets a model selectively focus on different parts of the input when processing each element.

📖

Human Reading Analogy

When you read "The bank was steep and muddy from the rain", you look back at context to determine whether "bank" means a financial institution or a river bank. You attend to "steep", "muddy", and "rain" to understand "bank".

Transformers do the same thing mathematically — every token computes a weighted blend of all other tokens based on relevance.

Slide 2 of 7

The Library Analogy

Attention can be understood as a fuzzy database lookup:

Q
Query — What you're looking for. "I want books about machine learning"
K
Keys — Index cards describing each item. "Neural Networks", "Cooking", "History"...
V
Values — The actual content of each item. The books themselves.
Query
Keys
Scores
Softmax
Weighted Values
Slide 3 of 7

Step 1: Compute Attention Scores

For each query token, we compute how similar it is to every key (including itself). Similarity = dot product.

score(Q, K) = Q · Kᵀ

Example — Computing scores for the word "tired":

"tired" vs "The":0.1 (low relevance)
"tired" vs "animal":0.9 (HIGH — animal is what's tired!)
"tired" vs "was":0.2
"tired" vs "tired":0.7 (self-attention)
💡 The dot product is high when two vectors point in similar directions — meaning the Query and Key are semantically similar.
Slide 4 of 7

Step 2: Scale and Softmax

Raw dot products can get very large, causing gradients to vanish. So we scale them first:

scaled_score = Q · Kᵀ / √d_k

Where d_k is the dimension of the key vectors (e.g., 64). This keeps values in a stable range.

Then we apply Softmax to convert scores into probabilities that sum to 1:

attention_weights = softmax(scaled_score)

After softmax, "tired" → all tokens:

The: 5% animal: 65% was: 10% tired: 20%

✅ All weights sum to 100%

Slide 5 of 7

Step 3: Weighted Sum of Values

Finally, multiply each Value vector by its attention weight and sum them all up:

output = Σ (attention_weight × Value)

For "tired" computing its output vector:

output = 0.65 × V("animal")
+ 0.20 × V("tired")
+ 0.10 × V("was")
+ 0.05 × V("The")

The output for "tired" is mostly influenced by "animal" (65%) — so the model now understands that "tired" relates to the animal!

Slide 6 of 7

The Complete Formula

Putting it all together in one elegant equation:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
1
QKᵀ — Dot product of all queries and keys (similarity scores)
2
/ √d_k — Scale to prevent exploding values
3
softmax() — Convert scores to probabilities (sum to 1)
4
· V — Weight and sum the value vectors
💡 This simple formula is done in parallel for ALL tokens simultaneously using matrix multiplication — incredibly efficient on GPUs!
Slide 7 of 7

Attention is Differentiable

A key property: attention weights are computed by a smooth, differentiable function. This means the model can learn which things to attend to through gradient descent.

What gets learned:

Which words are relevant to which
Syntactic relationships (subject-verb)
Semantic connections (pronoun-noun)
Long-range dependencies
Magic of gradients

The model doesn't need to be told what to attend to — it figures it out during training from millions of examples!

1 / 7
🎯 Interactive: Attention Weights
Click a word to see how much it "attends to" each other word in the sentence.

Brighter cells = higher attention weight. Each row = one query word attending to all other words.

Key Concepts

Query (Q)
What the current token is "looking for" in other tokens.
🔑
Key (K)
What each token "offers" to be found — describes its content.
💎
Value (V)
The actual information of each token, passed along when attended to.
🌡️
Softmax
Converts raw scores into probabilities that sum to 1.0.

Quick Check

What do attention weights represent in the formula Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V?

A
The size of the vocabulary
B
The position of each token
C
The learning rate
D
How much each token contributes to the output (probabilities summing to 1)