Module 4 · Modern LLMs

The Full Picture

Bring everything together! See the complete Transformer architecture, explore modern LLM variants, and understand how GPT, BERT, and Claude are built.

⏱️ 12 min 🎯 Lesson 11 of 11 🏆 Final Lesson

Slide 1 of 7

The Complete Transformer

Let's see the full architecture we've been building piece by piece:

Input
Tokens

→

Embedding
+ PE

→

Encoder
×N layers

→

Context
Vectors

Output
Tokens

→

Embedding
+ PE

→

Decoder
×N layers

→

Softmax
→ Next Token

                ⚡ The encoder and decoder are connected through cross-attention — the decoder's middle layer attends to the encoder's output.
              

Slide 2 of 7

Full Architecture Diagram

The golden arrow shows the cross-attention connection between encoder and decoder.

Slide 3 of 7

Modern Variants: Encoder-Only

BERT (and its family: RoBERTa, ALBERT) uses only the encoder. It's great for understanding text, not generating it.

📥

Input: Text (both sides of a [MASK])

⚡

Process: Bidirectional encoder — looks at all context

📤

Output: Contextual embeddings / fill-in-the-blank

Best for

✅ Text classification

✅ Named entity recognition

✅ Question answering

✅ Semantic search

❌ Not for generation

Slide 4 of 7

Modern Variants: Decoder-Only

GPT (and Claude, Llama, Gemini) uses only the decoder with causal masking. This is the dominant architecture for modern LLMs.

📥

Input: Prompt / context

⚡

Process: Causal decoder — can only look left

📤

Output: Next token probabilities → generates text

Best for

✅ Text generation

✅ Chatbots & assistants

✅ Code generation

✅ Creative writing

✅ Reasoning tasks

Slide 5 of 7

How LLMs are Trained

Training a large language model happens in stages:

Pre-training (Next Token Prediction)
Feed the model trillions of tokens from the internet. Train it to predict the next token. This is self-supervised — no human labels needed!

Supervised Fine-Tuning (SFT)
Fine-tune on high-quality examples of instructions + good responses. The model learns to follow instructions.

RLHF (Reinforcement Learning from Human Feedback)
Human raters rank responses. A reward model is trained. The LLM is trained to maximize the reward. Makes models helpful, harmless, and honest.

Slide 6 of 7

The Scale of Modern LLMs

Modern LLMs are mind-bogglingly large:

Model	Parameters	Context	Layers
GPT-2	1.5B	1,024	48
GPT-3	175B	4,096	96
Llama 3 70B	70B	128K	80
Claude 3	~100B+	200K	~90
GPT-4	~1.7T (est.)	128K	~120

                💡 GPT-3's 175B parameters weigh about 350 GB of disk space. Training it cost ~$4.6 million in compute!
              

Slide 7 of 7

🎉 Congratulations! You Made It!

You now understand the complete Transformer architecture! Let's recap everything you've learned:

✓

Tokenization — Text → Token IDs

✓

Embeddings — Token IDs → Meaning vectors

✓

Positional Encoding — Adding order awareness

✓

Self-Attention — Every token attends to every other

✓

Multi-Head Attention — Parallel attention heads

✓

Encoder — Understanding input deeply

✓

Decoder — Auto-regressive generation

                🚀 You now understand the technology powering ChatGPT, Claude, Gemini, and every other modern AI language model. Incredible!
              

1 / 7

🏛️

The Complete Picture

A Transformer is like a sophisticated translation chamber: input enters the encoder, which builds a deep understanding. The decoder then uses that understanding alongside what it's already written to produce the next word, one at a time — guided by 6 (or 96) layers of multi-head attention, residual connections, and feed-forward networks.

🏗️ Full Architecture Overview

The complete Encoder-Decoder Transformer. Encoder on the left, Decoder on the right, connected by cross-attention.

The Family of Transformers

🔍

Encoder-Only (BERT)

Bidirectional understanding. Best for classification, NER, search. Not for generation.

✍️

Decoder-Only (GPT)

Auto-regressive generation. Powers ChatGPT, Claude, Llama. The dominant architecture today.

🔄

Encoder-Decoder (T5)

Original architecture. Best for seq2seq tasks: translation, summarization. Used by T5, BART.

🚀

Scaling Laws

More parameters + more data = better performance. This discovery drove the LLM revolution.

Final Challenge

GPT-4, Claude, and Llama are all examples of which Transformer variant?

Encoder-only (like BERT)

Full Encoder-Decoder (like the original Transformer)

Decoder-only with causal (masked) self-attention

RNN with attention

🎓

Course Complete!

You've completed all 11 lessons and now have a solid understanding of how Transformers work. The technology behind every major LLM.

← Back to Course Home