Each head runs attention with its own learned W_Q, W_K, W_V
2
Each head produces an output vector of size d_v (e.g., 64)
3
All h outputs are concatenated → h × d_v dimensions (e.g., 8 × 64 = 512)
4
W_O projects back to d_model dimensions (e.g., 512)
Slide 4 of 5
Interactive: Head Visualization
Here's a visualization of 4 attention heads running in parallel for the same sentence:
Each head processes the same input but through different learned projections, capturing different patterns.
Slide 5 of 5
Why This is Powerful
Multi-Head Attention gives the model the ability to simultaneously understand:
Grammar
Semantics
Coreference
Proximity
Discourse
...all from the same layer, in one forward pass.
✅ The model isn't told which head should learn what — it discovers the most useful division of labor during training!
⚡ Compute efficiency: Each head works on a smaller d_k (d_model/h), so total compute is similar to one full-size head — but with h times more expressive power!
1 / 5
🎭
The Expert Panel Analogy
Imagine a committee of experts reviewing a document. The grammar expert focuses on sentence structure, the content expert on meaning, the fact-checker on references, and the rhetorician on tone. Multi-Head Attention is like this — each head is a different expert examining the text, and their conclusions are combined for a complete picture.
Key Concepts
🔀
Multi-Head Attention
Running h attention heads in parallel, each with different learned projections.
🔢
h (number of heads)
Original Transformer: h=8. Modern models: up to h=96 or more.
📎
Concatenation
Head outputs are concatenated, then projected by W_O back to d_model size.
📐
d_k = d_model / h
Each head works in a lower dimension (d_model/h) keeping total compute constant.
Quick Check
Why does Multi-Head Attention use multiple heads instead of one big attention head?
A
To increase the vocabulary size
B
To process multiple sentences at once
C
Each head can learn to capture different types of relationships simultaneously