Module 2 · Attention

Multi-Head Attention

Why one attention head isn't enough — and how running multiple attention mechanisms in parallel gives richer understanding.

⏱️ 10 min 🎯 Lesson 7 of 11

Slide 1 of 5

One Head Isn't Enough

A single attention head can only focus on one type of relationship at a time. But language has many simultaneous relationships:

Syntactic: Subject → Verb agreement ("The cats are")

Semantic: Word meaning in context ("bank" = river or finance?)

Positional: Nearby words tend to be related

Coreference: Pronouns → their referents

                💡 Solution: Run several attention heads in parallel, each learning to capture a different type of relationship!
              

Slide 2 of 5

Multiple Perspectives

Each head has its own W_Q, W_K, W_V matrices — so each head learns to attend to different things:

Head 1
Learns: Subject-Verb relationships
"cats" → "are"

Head 2
Learns: Noun-Adjective relationships
"big" → "dog"

Head 3
Learns: Coreference
"it" → "the cat"

Head 4
Learns: Positional proximity
adjacent words

GPT-3 has 96 heads

Original Transformer:
h = 8 heads

GPT-3:
h = 96 heads

Slide 3 of 5

The MHA Formula

Multi-Head Attention concatenates the output of all heads, then projects:

MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O

where each head is:

headᵢ = Attention(Q·W_Qᵢ, K·W_Kᵢ, V·W_Vᵢ)

Each head runs attention with its own learned W_Q, W_K, W_V

Each head produces an output vector of size d_v (e.g., 64)

All h outputs are concatenated → h × d_v dimensions (e.g., 8 × 64 = 512)

W_O projects back to d_model dimensions (e.g., 512)

Slide 4 of 5

Interactive: Head Visualization

Here's a visualization of 4 attention heads running in parallel for the same sentence:

Each head processes the same input but through different learned projections, capturing different patterns.

Slide 5 of 5

Why This is Powerful

Multi-Head Attention gives the model the ability to simultaneously understand:

Grammar

Semantics

Coreference

Proximity

Discourse

...all from the same layer, in one forward pass.

                ✅ The model isn't told which head should learn what — it discovers the most useful division of labor during training!
              

                ⚡ Compute efficiency: Each head works on a smaller d_k (d_model/h), so total compute is similar to one full-size head — but with h times more expressive power!
              

1 / 5

🎭

The Expert Panel Analogy

Imagine a committee of experts reviewing a document. The grammar expert focuses on sentence structure, the content expert on meaning, the fact-checker on references, and the rhetorician on tone. Multi-Head Attention is like this — each head is a different expert examining the text, and their conclusions are combined for a complete picture.

Key Concepts

🔀

Multi-Head Attention

Running h attention heads in parallel, each with different learned projections.

🔢

h (number of heads)

Original Transformer: h=8. Modern models: up to h=96 or more.

📎

Concatenation

Head outputs are concatenated, then projected by W_O back to d_model size.

📐

d_k = d_model / h

Each head works in a lower dimension (d_model/h) keeping total compute constant.

Quick Check

Why does Multi-Head Attention use multiple heads instead of one big attention head?

To increase the vocabulary size

To process multiple sentences at once

Each head can learn to capture different types of relationships simultaneously

To reduce memory usage