Slide 1 of 7
The Decoder's Job
The decoder generates output token by token, using the encoder's contextual representation and the tokens it has already generated.
Encoder Output
(full input context)
+
Generated So Far
"Je suis"
→
DECODER
→
Next Token
"étudiant"
It's auto-regressive: the decoder generates one token, feeds it back in, generates the next, and so on until the sequence is complete.
Slide 2 of 7
Three Sub-Layers
Unlike the encoder (2 sub-layers), each decoder layer has three sub-layers:
1
Masked Self-Attention
Attends to previously generated tokens only (can't peek at future tokens)
2
Cross-Attention (Encoder-Decoder Attention)
Attends to the encoder's output — this is how the decoder "reads" the input
3
Feed-Forward Network
Same as encoder — processes each position independently
All three have Add & Layer Norm.
Slide 3 of 7
Masked Self-Attention
During training, the decoder processes all output tokens in parallel (for efficiency). But we must prevent each token from "cheating" by looking at future tokens.
🔒 Causal Masking: When computing attention for token at position i, we mask out all positions j > i (set them to -∞ before softmax).
Generating "I love AI":
I
→ can only see:
I
love
→ can only see:
Ilove
AI
→ can only see:
IloveAI
Slide 4 of 7
Cross-Attention
This is where the magic of the Encoder-Decoder architecture happens. In cross-attention:
Q
Queries come from the decoder (what the decoder is currently generating)
K
Keys come from the encoder output (the encoded input)
V
Values come from the encoder output (the encoded input)
🗣️
Translation analogy
The decoder asks (Q): "What French word should I generate next?" It searches the encoded English (K) to find relevant context, and retrieves the information (V) from those encoder positions.
Slide 5 of 7
Auto-Regressive Generation
At inference time, the decoder generates one token at a time:
1
Start with <BOS> (beginning of sequence token)
2
Run decoder → get probability distribution over vocabulary
3
Sample or take argmax → choose next token (e.g., "The")
4
Append "The" to the sequence → run decoder again
5
Repeat until <EOS> (end of sequence) is generated
💡 This is why LLMs generate text word-by-word! Each token depends on all previous tokens. It's fundamentally sequential at inference time.
Slide 6 of 7
The Output Layer
After the final decoder layer, two operations convert the representation to a word:
Decoder output vector
[d_model dimensions]
↓ Linear projection
Logits vector
[vocab_size dimensions — one per token!]
↓ Softmax
Probability distribution
[sum = 1.0]
↓ Sample / argmax
Next token: "hello" (ID: 31373)
Slide 7 of 7
Interactive: Decoder Architecture
Note the three sub-layers: Masked Self-Attention, Cross-Attention (from encoder), and FFN. Data flows upward.