Module 2 · Attention

The Sequence Problem

Why did we need a new architecture? Understanding the limitations of sequential models and why attention was the solution.

⏱️ 9 min 🎯 Lesson 4 of 11

Slide 1 of 6

How RNNs Worked

Before Transformers, AI used Recurrent Neural Networks (RNNs). They processed text like we read: one word at a time, left to right, building up a "memory" as they go.

The

→

Memory: "The"

cat

→

Memory: "The cat"

sat

→

Memory fading...

...

→

Memory: ???

The problem: the "memory" (called a hidden state) is just a fixed-size vector. Squeezing an entire long document into one small vector loses information.

Slide 2 of 6

The Vanishing Gradient Problem

To learn, neural networks use backpropagation — sending error signals backwards through the network. In RNNs, this signal must travel through every time step.

                ❌ The further back in the sequence, the weaker the gradient signal gets — like a whisper getting quieter down a long hallway. Early words barely influence the learning.
              

This is the Vanishing Gradient Problem. LSTMs and GRUs helped somewhat, but didn't fully solve it for very long sequences.

word₁

←

word₂

←

word₃

←

word₄

← Error signal gets weaker the further back it goes

Slide 3 of 6

The Pronoun Problem

Consider this sentence:

"The trophy didn't fit in the suitcase because it was too big."

What does "it" refer to? the trophy (because the trophy was too big to fit)

"The trophy didn't fit in the suitcase because it was too small."

Now "it" refers to the suitcase!

To resolve "it", the model needs to understand both "trophy" and "suitcase" at the same time — not just remember "trophy" from a fading memory. RNNs struggle with this.

Slide 4 of 6

Another Problem: No Parallelism

RNNs must process words one at a time in sequence. You can't process word 5 until you've finished words 1–4.

❌ RNN: Sequential

Step 1: Process "The"

Step 2: Process "cat"

Step 3: Process "sat"

Step 4: Process "on"

Can't use GPU parallelism!

✅ Transformer: Parallel

Process "The"

Process "cat"

Process "sat"

Process "on"

All at once! 🚀

Full GPU utilization

Slide 5 of 6

The RNN vs Transformer Comparison

Feature	RNN/LSTM	Transformer
Processing	Sequential ❌	Parallel ✅
Long-range deps	Struggles ❌	Handles well ✅
Training speed	Slow ❌	Fast ✅
Context window	Limited ❌	Configurable ✅
GPU usage	Poor ❌	Excellent ✅

Slide 6 of 6

The Solution: Attention

The revolutionary insight: instead of compressing everything into one memory vector, let every word directly attend to every other word.

"The trophy didn't fit because it was too big."

The

trophy

didn't

fit

because

was

too

big

⚡ "it" directly attends to "trophy" — no memory required!

                ✅ With attention, information doesn't have to travel through time steps — any word can directly communicate with any other word!
              

1 / 6

📞

The Phone Tree vs Group Chat Analogy

RNNs are like a phone tree: Person A calls B, who calls C, who calls D. By the time D hears the message, it's been passed through multiple people and details are lost. Transformers are like a group chat — everyone hears everyone else directly, in parallel.

Key Concepts

🔄

RNN

Recurrent Neural Network — processes sequences one step at a time, left to right.

📉

Vanishing Gradient

Error signals weaken as they travel back through many time steps during training.

⚡

Parallelism

Transformers process all tokens simultaneously, fully utilizing modern GPUs.

🎯

Long-range Dependencies

The ability to connect information across long distances in text — attention handles this perfectly.

Quick Check

What is one major reason Transformers train faster than RNNs?

Transformers process all tokens in parallel, making full use of GPUs

Transformers have fewer parameters

Transformers skip punctuation tokens

Transformers use smaller vocabularies