Slide 1 of 6
The Encoder's Job
The encoder reads the input and produces a rich contextual representation for each token. This output is then used by the decoder to generate a response.
Raw text
"Hello world"
→
Tokens
[1234, 5678]
→
Embeddings
[vectors]
→
ENCODER
×6 layers
→
Context
[enriched vectors]
Think of the encoder as reading a book very carefully, building a deep understanding before answering any questions about it.
Slide 2 of 6
Encoder Layer Components
Each encoder layer has just two sub-layers:
1
Multi-Head Self-Attention
Lets each token attend to all other tokens in the input
2
Feed-Forward Network (FFN)
Two linear layers with ReLU activation — processes each position independently
Both have
Add & Layer Norm
Applied after each sub-layer
Slide 3 of 6
Residual Connections & Layer Norm
After each sub-layer, two important things happen:
+
Residual Connection (Add): The sub-layer's output is
added to its input
output = LayerNorm(x + Sublayer(x))
This allows gradients to flow easily during training (prevents vanishing gradients)
N
Layer Normalization: Normalizes the values to have mean 0 and variance 1. This stabilizes training and speeds convergence.
💡 Residual connections are like "shortcuts" — the input bypasses the sub-layer and gets added back in. This makes very deep networks trainable!
Slide 4 of 6
The Feed-Forward Network
The FFN is applied independently to each position — it's the same network for every token, but it processes them in parallel.
FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂
1
Linear layer: d_model → d_ff (expand)
e.g., 512 → 2048
2
ReLU activation (non-linearity)
3
Linear layer: d_ff → d_model (compress)
e.g., 2048 → 512
Why expand?
The expansion gives the network more capacity to compute complex transformations — like a "thinking" step where the model can explore ideas in a higher-dimensional space.
Slide 5 of 6
Stacking 6 Layers
The original Transformer stacks 6 identical encoder layers. Each layer refines the representation further:
Layer 1 — Learns basic patterns (word → nearby word)
Layer 2 — More complex syntactic relationships
Layer 3 — Semantic groupings (entities)
Layer 4 — Complex cross-sentence patterns
Layer 5 — Abstract relationships
Layer 6 — Rich contextual representation
Modern LLMs like GPT-4 stack 96–120 layers!
Slide 6 of 6
Interactive: Encoder Architecture
Here's the complete encoder layer — see how all components fit together:
The data flows upward through the layer. Both sub-layers use Add & Layer Norm.