Module 3 The Decoder
Module 3 · Architecture

The Decoder

The decoder generates output one token at a time. Learn about masked attention, cross-attention, and how text is generated.

⏱️ 12 min 🎯 Lesson 10 of 11
Slide 1 of 7

The Decoder's Job

The decoder generates output token by token, using the encoder's contextual representation and the tokens it has already generated.

Encoder Output
(full input context)
+
Generated So Far
"Je suis"
DECODER
Next Token
"étudiant"

It's auto-regressive: the decoder generates one token, feeds it back in, generates the next, and so on until the sequence is complete.

Slide 2 of 7

Three Sub-Layers

Unlike the encoder (2 sub-layers), each decoder layer has three sub-layers:

1
Masked Self-Attention
Attends to previously generated tokens only (can't peek at future tokens)
2
Cross-Attention (Encoder-Decoder Attention)
Attends to the encoder's output — this is how the decoder "reads" the input
3
Feed-Forward Network
Same as encoder — processes each position independently

All three have Add & Layer Norm.

Slide 3 of 7

Masked Self-Attention

During training, the decoder processes all output tokens in parallel (for efficiency). But we must prevent each token from "cheating" by looking at future tokens.

🔒 Causal Masking: When computing attention for token at position i, we mask out all positions j > i (set them to -∞ before softmax).

Generating "I love AI":

I → can only see: I
love → can only see: Ilove
AI → can only see: IloveAI
Slide 4 of 7

Cross-Attention

This is where the magic of the Encoder-Decoder architecture happens. In cross-attention:

Q
Queries come from the decoder (what the decoder is currently generating)
K
Keys come from the encoder output (the encoded input)
V
Values come from the encoder output (the encoded input)
🗣️

Translation analogy

The decoder asks (Q): "What French word should I generate next?" It searches the encoded English (K) to find relevant context, and retrieves the information (V) from those encoder positions.

Slide 5 of 7

Auto-Regressive Generation

At inference time, the decoder generates one token at a time:

1
Start with <BOS> (beginning of sequence token)
2
Run decoder → get probability distribution over vocabulary
3
Sample or take argmax → choose next token (e.g., "The")
4
Append "The" to the sequence → run decoder again
5
Repeat until <EOS> (end of sequence) is generated
💡 This is why LLMs generate text word-by-word! Each token depends on all previous tokens. It's fundamentally sequential at inference time.
Slide 6 of 7

The Output Layer

After the final decoder layer, two operations convert the representation to a word:

Decoder output vector
[d_model dimensions]
↓ Linear projection
Logits vector
[vocab_size dimensions — one per token!]
↓ Softmax
Probability distribution
[sum = 1.0]
↓ Sample / argmax
Next token: "hello" (ID: 31373)
Slide 7 of 7

Interactive: Decoder Architecture

Note the three sub-layers: Masked Self-Attention, Cross-Attention (from encoder), and FFN. Data flows upward.

1 / 7
✍️

The Author Writing a Translation Analogy

Imagine translating a book. You (the decoder) write one word at a time. As you write each word, you look back at what you've already written (masked self-attention), consult the original text (cross-attention to encoder), and then write the next word. This is exactly what the decoder does.

Key Concepts

🔒
Masked Self-Attention
Prevents decoder from seeing future tokens — only attends to past positions.
🌉
Cross-Attention
Q from decoder, K and V from encoder — how decoder reads the input context.
🔁
Auto-Regressive
Generating one token at a time, each conditioned on all previous tokens.
📊
Softmax Output
Final layer produces probabilities over the entire vocabulary for the next token.

Quick Check

In cross-attention, where do the Queries, Keys, and Values come from?

A
All three come from the encoder output
B
Q comes from the decoder, K and V come from the encoder output
C
All three come from the decoder's previous layer
D
Q from the encoder, K and V from the decoder