Module 2 · Attention

Self-Attention in Practice

Dive deep into self-attention — where a sequence attends to itself. See the step-by-step calculation with real numbers.

⏱️ 12 min 🎯 Lesson 6 of 11 ⭐ Core Concept

Slide 1 of 6

What is Self-Attention?

In self-attention, the Queries, Keys, and Values all come from the same sequence. Every word attends to every other word in the same sentence — including itself.

                💡 "Self" means the sequence is attending to itself. It's not comparing to an external database — it's exploring its own internal relationships.
              

This allows the model to understand how every word in a sentence relates to every other word, all at once.

Every word creates 3 projections from its embedding:

embedding

×W_Q

Query

embedding

×W_K

Key

embedding

×W_V

Value

Slide 2 of 6

The Weight Matrices W_Q, W_K, W_V

Each word embedding is multiplied by three different learned weight matrices to produce Q, K, and V:

W_Q — Projects embeddings into "what I'm looking for" space

W_K — Projects embeddings into "what I have to offer" space

W_V — Projects embeddings into "my actual information" space

Weight matrix sizes

W_Q: [d_model × d_k]

W_K: [d_model × d_k]

W_V: [d_model × d_v]

d_model=512, d_k=d_v=64
(in original paper)

                💡 These matrices W_Q, W_K, W_V are learned during training. The model learns what types of relationships are useful to pay attention to.
              

Slide 3 of 6

Step by Step: Computing Self-Attention

Let's trace through a simple 3-word sentence: "I love AI"

Each word gets an embedding: I=[1,0], love=[0,1], AI=[1,1] (simplified)

Multiply each by W_Q, W_K, W_V → get Q, K, V vectors for each word

For each Query, compute dot products with all Keys → raw scores

Divide by √d_k (e.g., √64=8), apply softmax → attention weights

Multiply weights by Values and sum → new enriched representation

The output for each word is a weighted mix of all other words' values — enriched with context!

Slide 4 of 6

Visualizing What Gets Learned

After training, attention patterns reveal meaningful linguistic structure. Here's what different words attend to in "The animal was tired":

Attention patterns (simplified):

The → attends mostly to → animal (its noun)

animal → attends mostly to → tired (its adjective)

tired → attends mostly to → animal (what's tired)

                ✅ The model learns to connect semantically related words without being explicitly programmed to do so!
              

Slide 5 of 6

Self-Attention: Interactive Visualization

The bar chart below shows how "tired" distributes its attention across the sentence "The animal was tired":

                🔍 "tired" pays the most attention (65%) to "animal" — because semantically, "animal" is the subject of being tired. This is how the model resolves meaning!
              

Slide 6 of 6

Why "Self"-Attention?

The beauty of self-attention is that it's universal — it works for any relationship in a sequence:

Coreference: "it" → "the animal" (pronoun resolution)

Agreement: "cats" → "are" (subject-verb agreement)

Modification: "big" → "dog" (adjective-noun)

Semantic: "Paris" → "France" (entity-location)

                💡 The model isn't programmed with grammar rules — it discovers these relationships by itself during training, purely through learning the best way to predict the next word!
              

1 / 6

🌐

The Meeting Room Analogy

Imagine every word in a sentence is a person in a meeting room. In self-attention, every person can send a message directly to every other person. Each person reads all messages and decides how much to listen to each one. The output is everyone's updated understanding based on the whole group's input.

Key Concepts

🔄

Self-Attention

Attention where Q, K, V all come from the same sequence — the sequence attends to itself.

🧮

W_Q, W_K, W_V

Learned weight matrices that project embeddings into Q, K, V spaces.

📊

Attention Matrix

An N×N matrix showing how much each token attends to each other token.

🌊

Contextual Embeddings

After self-attention, each token's representation includes information from all other tokens.

Quick Check

In self-attention, where do the Query, Key, and Value vectors come from?

Q from input, K and V from a separate encoder

All three (Q, K, V) come from the same input sequence via different weight matrices

They are randomly initialized and not learned

Q from positional encoding, K and V from embeddings