Module 2 · Attention

The Attention Mechanism

The core innovation of Transformers. Learn how "attention" lets every word focus on what matters most.

⏱️ 12 min 🎯 Lesson 5 of 11 ⭐ Core Concept

Slide 1 of 7

What is Attention?

Attention is a mechanism that lets a model selectively focus on different parts of the input when processing each element.

📖

Human Reading Analogy

When you read "The bank was steep and muddy from the rain", you look back at context to determine whether "bank" means a financial institution or a river bank. You attend to "steep", "muddy", and "rain" to understand "bank".

Transformers do the same thing mathematically — every token computes a weighted blend of all other tokens based on relevance.

Slide 2 of 7

The Library Analogy

Attention can be understood as a fuzzy database lookup:

Query — What you're looking for. "I want books about machine learning"

Keys — Index cards describing each item. "Neural Networks", "Cooking", "History"...

Values — The actual content of each item. The books themselves.

Query

⊗

Keys

→

Scores

→

Softmax

→

Weighted Values

Slide 3 of 7

Step 1: Compute Attention Scores

For each query token, we compute how similar it is to every key (including itself). Similarity = dot product.

score(Q, K) = Q · Kᵀ

Example — Computing scores for the word "tired":

"tired" vs "The":0.1 (low relevance)

"tired" vs "animal":0.9 (HIGH — animal is what's tired!)

"tired" vs "was":0.2

"tired" vs "tired":0.7 (self-attention)

                💡 The dot product is high when two vectors point in similar directions — meaning the Query and Key are semantically similar.
              

Slide 4 of 7

Step 2: Scale and Softmax

Raw dot products can get very large, causing gradients to vanish. So we scale them first:

scaled_score = Q · Kᵀ / √d_k

Where d_k is the dimension of the key vectors (e.g., 64). This keeps values in a stable range.

Then we apply Softmax to convert scores into probabilities that sum to 1:

attention_weights = softmax(scaled_score)

After softmax, "tired" → all tokens:

The: 5% animal: 65% was: 10% tired: 20%

✅ All weights sum to 100%

Slide 5 of 7

Step 3: Weighted Sum of Values

Finally, multiply each Value vector by its attention weight and sum them all up:

output = Σ (attention_weight × Value)

For "tired" computing its output vector:

output = 0.65 × V("animal")

+ 0.20 × V("tired")

+ 0.10 × V("was")

+ 0.05 × V("The")

The output for "tired" is mostly influenced by "animal" (65%) — so the model now understands that "tired" relates to the animal!

Slide 6 of 7

The Complete Formula

Putting it all together in one elegant equation:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

QKᵀ — Dot product of all queries and keys (similarity scores)

/ √d_k — Scale to prevent exploding values

softmax() — Convert scores to probabilities (sum to 1)

· V — Weight and sum the value vectors

                💡 This simple formula is done in parallel for ALL tokens simultaneously using matrix multiplication — incredibly efficient on GPUs!
              

Slide 7 of 7

Attention is Differentiable

A key property: attention weights are computed by a smooth, differentiable function. This means the model can learn which things to attend to through gradient descent.

What gets learned:

✓

Which words are relevant to which

✓

Syntactic relationships (subject-verb)

✓

Semantic connections (pronoun-noun)

✓

Long-range dependencies

Magic of gradients

The model doesn't need to be told what to attend to — it figures it out during training from millions of examples!

1 / 7

🎯 Interactive: Attention Weights

Click a word to see how much it "attends to" each other word in the sentence.

Brighter cells = higher attention weight. Each row = one query word attending to all other words.

Key Concepts

❓

Query (Q)

What the current token is "looking for" in other tokens.

🔑

Key (K)

What each token "offers" to be found — describes its content.

💎

Value (V)

The actual information of each token, passed along when attended to.

🌡️

Softmax

Converts raw scores into probabilities that sum to 1.0.

Quick Check

What do attention weights represent in the formula Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V?

The size of the vocabulary

The position of each token

The learning rate

How much each token contributes to the output (probabilities summing to 1)