Module 3 · Architecture

The Encoder

The encoder reads and deeply understands the input. Learn how all the pieces combine into a single powerful layer, stacked 6 times.

⏱️ 10 min 🎯 Lesson 9 of 11

Slide 1 of 6

The Encoder's Job

The encoder reads the input and produces a rich contextual representation for each token. This output is then used by the decoder to generate a response.

Raw text
"Hello world"

→

Tokens
[1234, 5678]

→

Embeddings
[vectors]

→

ENCODER
×6 layers

→

Context
[enriched vectors]

Think of the encoder as reading a book very carefully, building a deep understanding before answering any questions about it.

Slide 2 of 6

Encoder Layer Components

Each encoder layer has just two sub-layers:

Multi-Head Self-Attention
Lets each token attend to all other tokens in the input

Feed-Forward Network (FFN)
Two linear layers with ReLU activation — processes each position independently

Both have

Add & Layer Norm

Applied after each sub-layer

Slide 3 of 6

Residual Connections & Layer Norm

After each sub-layer, two important things happen:

Residual Connection (Add): The sub-layer's output is added to its input

output = LayerNorm(x + Sublayer(x))

This allows gradients to flow easily during training (prevents vanishing gradients)

Layer Normalization: Normalizes the values to have mean 0 and variance 1. This stabilizes training and speeds convergence.

                💡 Residual connections are like "shortcuts" — the input bypasses the sub-layer and gets added back in. This makes very deep networks trainable!
              

Slide 4 of 6

The Feed-Forward Network

The FFN is applied independently to each position — it's the same network for every token, but it processes them in parallel.

FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂

Linear layer: d_model → d_ff (expand)
e.g., 512 → 2048

ReLU activation (non-linearity)

Linear layer: d_ff → d_model (compress)
e.g., 2048 → 512

Why expand?

The expansion gives the network more capacity to compute complex transformations — like a "thinking" step where the model can explore ideas in a higher-dimensional space.

Slide 5 of 6

Stacking 6 Layers

The original Transformer stacks 6 identical encoder layers. Each layer refines the representation further:

Layer 1 — Learns basic patterns (word → nearby word)

Layer 2 — More complex syntactic relationships

Layer 3 — Semantic groupings (entities)

Layer 4 — Complex cross-sentence patterns

Layer 5 — Abstract relationships

Layer 6 — Rich contextual representation

Modern LLMs like GPT-4 stack 96–120 layers!

Slide 6 of 6

Interactive: Encoder Architecture

Here's the complete encoder layer — see how all components fit together:

The data flows upward through the layer. Both sub-layers use Add & Layer Norm.

1 / 6

📚

The Scholar Reading Analogy

Each encoder layer is like a pass through a text by an increasingly expert scholar. The first pass notes individual words. The second pass spots phrases and grammar. Later passes uncover abstract themes, historical context, and deep meaning. Each reading enriches the understanding — that's what 6 (or 96) layers do!

Key Concepts

📥

Encoder

Reads and encodes the input into rich contextual representations.

➕

Residual Connection

x + Sublayer(x) — bypasses each sub-layer to aid gradient flow in deep networks.

📊

Layer Normalization

Normalizes activations to zero mean and unit variance — stabilizes training.

🧠

FFN

Feed-Forward Network — same MLP applied to each position independently.

Quick Check

What are the two main sub-layers inside each encoder layer?

Embedding and Positional Encoding

Cross-Attention and Masked Attention

Multi-Head Self-Attention and Feed-Forward Network

LSTM and GRU