Module 3 · Architecture

The Decoder

The decoder generates output one token at a time. Learn about masked attention, cross-attention, and how text is generated.

⏱️ 12 min 🎯 Lesson 10 of 11

Slide 1 of 7

The Decoder's Job

The decoder generates output token by token, using the encoder's contextual representation and the tokens it has already generated.

Encoder Output
(full input context)

Generated So Far
"Je suis"

→

DECODER

→

Next Token
"étudiant"

It's auto-regressive: the decoder generates one token, feeds it back in, generates the next, and so on until the sequence is complete.

Slide 2 of 7

Three Sub-Layers

Unlike the encoder (2 sub-layers), each decoder layer has three sub-layers:

Masked Self-Attention
Attends to previously generated tokens only (can't peek at future tokens)

Cross-Attention (Encoder-Decoder Attention)
Attends to the encoder's output — this is how the decoder "reads" the input

Feed-Forward Network
Same as encoder — processes each position independently

All three have Add & Layer Norm.

Slide 3 of 7

Masked Self-Attention

During training, the decoder processes all output tokens in parallel (for efficiency). But we must prevent each token from "cheating" by looking at future tokens.

                🔒 Causal Masking: When computing attention for token at position i, we mask out all positions j > i (set them to -∞ before softmax).
              

Generating "I love AI":

I → can only see: I

love → can only see: Ilove

AI → can only see: IloveAI

Slide 4 of 7

Cross-Attention

This is where the magic of the Encoder-Decoder architecture happens. In cross-attention:

Queries come from the decoder (what the decoder is currently generating)

Keys come from the encoder output (the encoded input)

Values come from the encoder output (the encoded input)

🗣️

Translation analogy

The decoder asks (Q): "What French word should I generate next?" It searches the encoded English (K) to find relevant context, and retrieves the information (V) from those encoder positions.

Slide 5 of 7

Auto-Regressive Generation

At inference time, the decoder generates one token at a time:

Start with <BOS> (beginning of sequence token)

Run decoder → get probability distribution over vocabulary

Sample or take argmax → choose next token (e.g., "The")

Append "The" to the sequence → run decoder again

Repeat until <EOS> (end of sequence) is generated

                💡 This is why LLMs generate text word-by-word! Each token depends on all previous tokens. It's fundamentally sequential at inference time.
              

Slide 6 of 7

The Output Layer

After the final decoder layer, two operations convert the representation to a word:

Decoder output vector
[d_model dimensions]

↓ Linear projection

Logits vector
[vocab_size dimensions — one per token!]

↓ Softmax

Probability distribution
[sum = 1.0]

↓ Sample / argmax

Next token: "hello" (ID: 31373)

Slide 7 of 7

Interactive: Decoder Architecture

Note the three sub-layers: Masked Self-Attention, Cross-Attention (from encoder), and FFN. Data flows upward.

1 / 7

✍️

The Author Writing a Translation Analogy

Imagine translating a book. You (the decoder) write one word at a time. As you write each word, you look back at what you've already written (masked self-attention), consult the original text (cross-attention to encoder), and then write the next word. This is exactly what the decoder does.

Key Concepts

🔒

Masked Self-Attention

Prevents decoder from seeing future tokens — only attends to past positions.

🌉

Cross-Attention

Q from decoder, K and V from encoder — how decoder reads the input context.

🔁

Auto-Regressive

Generating one token at a time, each conditioned on all previous tokens.

📊

Softmax Output

Final layer produces probabilities over the entire vocabulary for the next token.

Quick Check

In cross-attention, where do the Queries, Keys, and Values come from?

All three come from the encoder output

Q comes from the decoder, K and V come from the encoder output

All three come from the decoder's previous layer

Q from the encoder, K and V from the decoder