Module 1 · Foundations
Tokenization: Text → Numbers
Learn how AI models break text into tokens — the first step in processing language.
The Dictionary Analogy
Think of the tokenizer as a special dictionary. Every possible word-piece has a page number. The model gets the page numbers, not the words — and it's surprisingly good at understanding meaning from page numbers alone.
Key Concepts
Token
A chunk of text (word, subword, or character) mapped to a unique integer ID.
Vocabulary
The complete list of all tokens the model knows (~50K–128K tokens).
BPE
Byte-Pair Encoding — merges frequent character pairs to build an efficient vocabulary.
Context Window
The maximum number of tokens a model can process at once (e.g., 128K for GPT-4).
Quick Check
Why do modern tokenizers split "unbelievable" into sub-parts like "un" + "believ" + "able"?