Module 2 Multi-Head Attention
Module 2 · Attention

Multi-Head Attention

Why one attention head isn't enough — and how running multiple attention mechanisms in parallel gives richer understanding.

⏱️ 10 min 🎯 Lesson 7 of 11
Slide 1 of 5

One Head Isn't Enough

A single attention head can only focus on one type of relationship at a time. But language has many simultaneous relationships:

1
Syntactic: Subject → Verb agreement ("The cats are")
2
Semantic: Word meaning in context ("bank" = river or finance?)
3
Positional: Nearby words tend to be related
4
Coreference: Pronouns → their referents
💡 Solution: Run several attention heads in parallel, each learning to capture a different type of relationship!
Slide 2 of 5

Multiple Perspectives

Each head has its own W_Q, W_K, W_V matrices — so each head learns to attend to different things:

Head 1
Learns: Subject-Verb relationships
"cats" → "are"
Head 2
Learns: Noun-Adjective relationships
"big" → "dog"
Head 3
Learns: Coreference
"it" → "the cat"
Head 4
Learns: Positional proximity
adjacent words
GPT-3 has 96 heads
Original Transformer:
h = 8 heads

GPT-3:
h = 96 heads
Slide 3 of 5

The MHA Formula

Multi-Head Attention concatenates the output of all heads, then projects:

MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O

where each head is:

headᵢ = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)
1
Each head runs attention with its own learned W_Q, W_K, W_V
2
Each head produces an output vector of size d_v (e.g., 64)
3
All h outputs are concatenated → h × d_v dimensions (e.g., 8 × 64 = 512)
4
W_O projects back to d_model dimensions (e.g., 512)
Slide 4 of 5

Interactive: Head Visualization

Here's a visualization of 4 attention heads running in parallel for the same sentence:

Each head processes the same input but through different learned projections, capturing different patterns.

Slide 5 of 5

Why This is Powerful

Multi-Head Attention gives the model the ability to simultaneously understand:

Grammar
Semantics
Coreference
Proximity
Discourse

...all from the same layer, in one forward pass.

✅ The model isn't told which head should learn what — it discovers the most useful division of labor during training!
Compute efficiency: Each head works on a smaller d_k (d_model/h), so total compute is similar to one full-size head — but with h times more expressive power!
1 / 5
🎭

The Expert Panel Analogy

Imagine a committee of experts reviewing a document. The grammar expert focuses on sentence structure, the content expert on meaning, the fact-checker on references, and the rhetorician on tone. Multi-Head Attention is like this — each head is a different expert examining the text, and their conclusions are combined for a complete picture.

Key Concepts

🔀
Multi-Head Attention
Running h attention heads in parallel, each with different learned projections.
🔢
h (number of heads)
Original Transformer: h=8. Modern models: up to h=96 or more.
📎
Concatenation
Head outputs are concatenated, then projected by W_O back to d_model size.
📐
d_k = d_model / h
Each head works in a lower dimension (d_model/h) keeping total compute constant.

Quick Check

Why does Multi-Head Attention use multiple heads instead of one big attention head?

A
To increase the vocabulary size
B
To process multiple sentences at once
C
Each head can learn to capture different types of relationships simultaneously
D
To reduce memory usage