Module 1 · Foundations

Tokenization: Text → Numbers

Learn how AI models break text into tokens — the first step in processing language.

⏱️ 8 min 🎯 Lesson 2 of 11

Slide 1 of 6

Computers Speak Numbers

Computers are fundamentally number-crunching machines. They can't directly process text like "Hello world" — they need numbers.

                    ❌ Computer cannot process: 
"Hello, how are you?"
                  

                    ✅ Computer CAN process: 
[15496, 11, 703, 389, 345, 30]
                  

The first step

Text String

↓

Token IDs

Tokenization is the process of converting text into a sequence of numbers that a model can process.

Slide 2 of 6

What is a Token?

A token is a chunk of text — usually a word, part of a word, or punctuation mark. Each token maps to a unique integer ID.

Example sentence:

"Transformers are amazing!"

Transform ers are amaz ing !

9602 364 389 6581 278 0

Notice that "Transformers" is split into two tokens: "Transform" + "ers". This is called subword tokenization.

Slide 3 of 6

Why Subword Tokenization?

Why not just use whole words? Three big problems:

New words — "COVID-19", "blockchain", slang. A whole-word vocab can't handle them.

Huge vocabulary — English has 170,000+ words. That's an enormous lookup table.

Morphology — "run", "running", "runner" share a root. Subword captures this structure.

                💡 BPE (Byte-Pair Encoding) is the most popular algorithm. It starts with characters and merges the most frequent pairs together until it has a good vocabulary (~50,000 tokens).
              

Slide 4 of 6

Real Examples from GPT-4

Here's how different texts tokenize in real models:

"Hello world" → 2 tokens

Hello world

"unbelievable" → 3 tokens

unbelievable

"ChatGPT" → 3 tokens

ChatGPT

💡 1 token ≈ 4 characters in English. GPT-4 can process up to 128,000 tokens at once.

Slide 5 of 6

The Vocabulary

Every tokenizer has a vocabulary — a fixed list of all possible tokens, each with a unique ID.

Model	Vocab Size
GPT-2	50,257
GPT-4	100,277
BERT	30,522
Llama 3	128,000

Token lookup

"hello"→ 31373

"world"→ 995

"!"→ 0

"AI"→ 20185

Slide 6 of 6

Try It: Interactive Demo

See tokenization in action! Type any text below and watch it split into tokens:

Tokens will appear here...

Token count: —

1 / 6

📖

The Dictionary Analogy

Think of the tokenizer as a special dictionary. Every possible word-piece has a page number. The model gets the page numbers, not the words — and it's surprisingly good at understanding meaning from page numbers alone.

Key Concepts

🔤

Token

A chunk of text (word, subword, or character) mapped to a unique integer ID.

📚

Vocabulary

The complete list of all tokens the model knows (~50K–128K tokens).

🧩

BPE

Byte-Pair Encoding — merges frequent character pairs to build an efficient vocabulary.

📏

Context Window

The maximum number of tokens a model can process at once (e.g., 128K for GPT-4).

Quick Check

Why do modern tokenizers split "unbelievable" into sub-parts like "un" + "believ" + "able"?

To confuse the model

Because computers can't store long words

To handle rare/unknown words and share vocabulary across similar words

Because "unbelievable" has no meaning as a whole