Module 2 The Sequence Problem
Module 2 · Attention

The Sequence Problem

Why did we need a new architecture? Understanding the limitations of sequential models and why attention was the solution.

⏱️ 9 min 🎯 Lesson 4 of 11
Slide 1 of 6

How RNNs Worked

Before Transformers, AI used Recurrent Neural Networks (RNNs). They processed text like we read: one word at a time, left to right, building up a "memory" as they go.

The
Memory: "The"
cat
Memory: "The cat"
sat
Memory fading...
...
Memory: ???

The problem: the "memory" (called a hidden state) is just a fixed-size vector. Squeezing an entire long document into one small vector loses information.

Slide 2 of 6

The Vanishing Gradient Problem

To learn, neural networks use backpropagation — sending error signals backwards through the network. In RNNs, this signal must travel through every time step.

❌ The further back in the sequence, the weaker the gradient signal gets — like a whisper getting quieter down a long hallway. Early words barely influence the learning.

This is the Vanishing Gradient Problem. LSTMs and GRUs helped somewhat, but didn't fully solve it for very long sequences.

word₁
word₂
word₃
word₄

← Error signal gets weaker the further back it goes

Slide 3 of 6

The Pronoun Problem

Consider this sentence:

"The trophy didn't fit in the suitcase because it was too big."

What does "it" refer to? the trophy (because the trophy was too big to fit)

"The trophy didn't fit in the suitcase because it was too small."

Now "it" refers to the suitcase!

To resolve "it", the model needs to understand both "trophy" and "suitcase" at the same time — not just remember "trophy" from a fading memory. RNNs struggle with this.

Slide 4 of 6

Another Problem: No Parallelism

RNNs must process words one at a time in sequence. You can't process word 5 until you've finished words 1–4.

❌ RNN: Sequential

Step 1: Process "The"
Step 2: Process "cat"
Step 3: Process "sat"
Step 4: Process "on"

Can't use GPU parallelism!

✅ Transformer: Parallel

Process "The"
Process "cat"
Process "sat"
Process "on"

All at once! 🚀

Full GPU utilization

Slide 5 of 6

The RNN vs Transformer Comparison

FeatureRNN/LSTMTransformer
ProcessingSequential ❌Parallel ✅
Long-range depsStruggles ❌Handles well ✅
Training speedSlow ❌Fast ✅
Context windowLimited ❌Configurable ✅
GPU usagePoor ❌Excellent ✅
Slide 6 of 6

The Solution: Attention

The revolutionary insight: instead of compressing everything into one memory vector, let every word directly attend to every other word.

"The trophy didn't fit because it was too big."

The
trophy
didn't
fit
because
it
was
too
big

⚡ "it" directly attends to "trophy" — no memory required!

✅ With attention, information doesn't have to travel through time steps — any word can directly communicate with any other word!
1 / 6
📞

The Phone Tree vs Group Chat Analogy

RNNs are like a phone tree: Person A calls B, who calls C, who calls D. By the time D hears the message, it's been passed through multiple people and details are lost. Transformers are like a group chat — everyone hears everyone else directly, in parallel.

Key Concepts

🔄
RNN
Recurrent Neural Network — processes sequences one step at a time, left to right.
📉
Vanishing Gradient
Error signals weaken as they travel back through many time steps during training.
Parallelism
Transformers process all tokens simultaneously, fully utilizing modern GPUs.
🎯
Long-range Dependencies
The ability to connect information across long distances in text — attention handles this perfectly.

Quick Check

What is one major reason Transformers train faster than RNNs?

A
Transformers process all tokens in parallel, making full use of GPUs
B
Transformers have fewer parameters
C
Transformers skip punctuation tokens
D
Transformers use smaller vocabularies