Module 1 Tokenization
Module 1 · Foundations

Tokenization: Text → Numbers

Learn how AI models break text into tokens — the first step in processing language.

⏱️ 8 min 🎯 Lesson 2 of 11
Slide 1 of 6

Computers Speak Numbers

Computers are fundamentally number-crunching machines. They can't directly process text like "Hello world" — they need numbers.

❌ Computer cannot process:
"Hello, how are you?"
✅ Computer CAN process:
[15496, 11, 703, 389, 345, 30]
The first step
Text String
Token IDs

Tokenization is the process of converting text into a sequence of numbers that a model can process.

Slide 2 of 6

What is a Token?

A token is a chunk of text — usually a word, part of a word, or punctuation mark. Each token maps to a unique integer ID.

Example sentence:

"Transformers are amazing!"

Transform ers are amaz ing !
9602 364 389 6581 278 0

Notice that "Transformers" is split into two tokens: "Transform" + "ers". This is called subword tokenization.

Slide 3 of 6

Why Subword Tokenization?

Why not just use whole words? Three big problems:

1
New words — "COVID-19", "blockchain", slang. A whole-word vocab can't handle them.
2
Huge vocabulary — English has 170,000+ words. That's an enormous lookup table.
3
Morphology — "run", "running", "runner" share a root. Subword captures this structure.
💡 BPE (Byte-Pair Encoding) is the most popular algorithm. It starts with characters and merges the most frequent pairs together until it has a good vocabulary (~50,000 tokens).
Slide 4 of 6

Real Examples from GPT-4

Here's how different texts tokenize in real models:

"Hello world" → 2 tokens

Hello world

"unbelievable" → 3 tokens

unbelievable

"ChatGPT" → 3 tokens

ChatGPT

💡 1 token ≈ 4 characters in English. GPT-4 can process up to 128,000 tokens at once.

Slide 5 of 6

The Vocabulary

Every tokenizer has a vocabulary — a fixed list of all possible tokens, each with a unique ID.

ModelVocab Size
GPT-250,257
GPT-4100,277
BERT30,522
Llama 3128,000
Token lookup
"hello"→ 31373
"world"→ 995
"!"→ 0
"AI"→ 20185
Slide 6 of 6

Try It: Interactive Demo

See tokenization in action! Type any text below and watch it split into tokens:

Tokens will appear here...

Token count:

1 / 6
📖

The Dictionary Analogy

Think of the tokenizer as a special dictionary. Every possible word-piece has a page number. The model gets the page numbers, not the words — and it's surprisingly good at understanding meaning from page numbers alone.

Key Concepts

🔤
Token
A chunk of text (word, subword, or character) mapped to a unique integer ID.
📚
Vocabulary
The complete list of all tokens the model knows (~50K–128K tokens).
🧩
BPE
Byte-Pair Encoding — merges frequent character pairs to build an efficient vocabulary.
📏
Context Window
The maximum number of tokens a model can process at once (e.g., 128K for GPT-4).

Quick Check

Why do modern tokenizers split "unbelievable" into sub-parts like "un" + "believ" + "able"?

A
To confuse the model
B
Because computers can't store long words
C
To handle rare/unknown words and share vocabulary across similar words
D
Because "unbelievable" has no meaning as a whole