How Tokenization Works
Language models don't read words — they read tokens. Text is split into subword pieces using BPE (Byte-Pair Encoding), each mapped to an integer ID and then to a dense vector (embedding). Try the presets or type your own text below.
💡 Analogy: Tokenization is like breaking a sentence into Scrabble tiles — except the "tile set" has ~50,000 subword entries. Each tile gets a unique ID number, which the model later uses to look up a 768-number vector capturing its meaning.
Each token ID looks up a row in a learned embedding matrix (e.g. 50,257 × 768 for GPT-2). The result is a 768-dimensional vector that captures the token's meaning.
Step 0: Input Tokens
💡 Think of it as: Assigning each word fragment a unique barcode number from a fixed catalog of ~50,000 entries.
The journey begins with raw text. The tokenizer splits the input into subword tokens — the atomic units the model processes. Each token is assigned an integer ID from the vocabulary (typically ~50,000 entries).
The Big Picture: Encoder–Decoder Architecture
The original Transformer (2017) uses an Encoder–Decoder design. Most modern LLMs (GPT, Claude, Llama) use only the Decoder half. During training, the model sees both input and expected output. During inference, the decoder generates tokens one at a time, feeding each output back as input.
Token Examples
Step 1: Embedding + Positional Encoding
💡 Analogy: The embedding table is like a massive dictionary where each word maps to a list of 768 numbers capturing its meaning. Positional encoding is like adding page numbers to a shuffled book — it restores order.
Each token ID is used to look up a row in the embedding matrix — a giant table of learned vectors. For GPT-2, this matrix has 50,257 rows × 768 columns. Then a positional encoding is added so the model knows word order (since attention has no inherent notion of position).
Embedding Lookup Table
Positional Encoding (Sine Waves)
Different dimensions use sine/cosine waves at different frequencies. Each position gets a unique "fingerprint" that tells the model where it is in the sequence.
Word Embedding Space (2D Projection)
In reality, embeddings live in 768+ dimensions. Here we project to 2D to visualize how semantically similar words cluster together. Notice: dog, puppy, and cat point in similar directions — they share meaning. skateboard and car point elsewhere.
Why Positional Encoding Matters
Attention has no notion of position by default — it can attend equally to any token in any order. Positional encoding tells the model where each token is in the sequence. Slide the bar below to remove position information and watch attention patterns degrade:
Attention Pattern (Degrading)
💡 With position info removed, the model loses track of word order. Attention becomes confused.
Step 2: Self-Attention
💡 Analogy: Imagine reading a sentence with 8 different-colored highlighters at once. Each highlighter focuses on a different type of relationship: who does what, what modifies what, what refers to what. Multi-head attention lets the model "highlight" the sentence in 8 ways simultaneously.
The core innovation of Transformers. Each token creates three vectors — Query (Q), Key (K),
and Value (V) — by multiplying with learned weight matrices. The attention score between two tokens is
computed as softmax(Q·Kᵀ / √dk), where dk is the dimension of the key vectors (enables stable gradients).
Q — Query
"What am I looking for?"
Like a search query you type.
K — Key
"What do I contain?"
Like a page title or tag.
V — Value
"Here's my actual content."
The information to retrieve.
Interactive: Click a word to see its attention
Attention Heatmap
Each cell shows how much attention the row-word pays to the column-word. Click a word above to highlight its row.
Multi-Head Attention (8 heads)
Instead of one attention computation, the model runs 8 parallel heads — each learning to focus on different relationships (syntax, coreference, semantics, etc.). Their outputs are concatenated and projected.
Projection
Each head independently computes attention → outputs are concatenated → projected back to model dimension
Deep Dive: Q·K·V Computation
Click a word to see how Query (Q) matches against all Keys (K) to produce attention scores. The softmax converts these scores to percentages, which weight the Values (V) in the output.
Real Sentence Examples
See how attention mechanisms help the model understand different linguistic phenomena:
Why Multi-Head Attention?
Compare what a single attention head can capture vs. the full multi-head power:
Step 3: Add & Normalize
💡 Analogy: Residual connections are like keeping a backup of your original notes. After the attention layer "annotates" your notes, you merge the annotations with the originals — ensuring no information is lost. Layer Norm then tidies everything up so the values don't explode or vanish.
After attention, a residual connection adds the original input back to the output (x + Sublayer(x)). This is followed by Layer Normalization, which stabilizes training by normalizing values across the feature dimension. Residual connections allow gradients to flow easily through deep networks (solving the vanishing gradient problem).
(from previous layer)
or FFN
(skip connection)
normalize
Layer Normalization — Animated
Watch how wildly varying values get "tamed" into a stable range (mean≈0, std≈1). This prevents values from exploding or vanishing through 96+ layers.
Step 4: FFN — Feed-Forward Network (also called MLP — Multi-Layer Perceptron)
💡 Analogy: If attention is "which words should I look at?", the feed-forward network is "what do I know about what I just saw?" It's the model's memory bank — expanding the data to search a broader set of patterns, then compressing it back to a manageable size.
After attention, each token passes independently through a 2-layer fully-connected network. This is where the model stores and retrieves factual knowledge. The hidden layer expands the dimension (768 → 3072), applies the GELU (Gaussian Error Linear Unit) activation, then compresses back (3072 → 768).
GELU (Gaussian Error Linear Unit) is a smooth activation function that gates values based on their probability of being positive — it's like a "soft switch" that decides which neurons should fire.
Why "MLP"? MLP stands for Multi-Layer Perceptron — the simplest type of neural network with an input layer, hidden layer(s), and output layer. In Transformers, the FFN block is a 2-layer MLP applied independently to each token.
Step 5: Architecture Types
The Transformer paper introduced an Encoder–Decoder design, but modern models use three main variants depending on the task. Additionally, non-transformer architectures have emerged as alternatives.
Transformer Variants
Beyond Transformers
Step 6: Stacked Layers
💡 Analogy: Each layer is like a successive round of editing. The first few layers handle syntax and grammar, middle layers capture relationships and facts, and the final layers refine the prediction. Like a document passing through 12+ editors, each specializing in a different aspect.
A single Transformer block (Attention → Add&Norm → FFN → Add&Norm) is repeated N times. GPT-2 has 12 layers, GPT-3 has 96 layers, Llama 3 405B has 126 layers. Each layer refines the token representations, building increasingly abstract understanding.
Step 7: Output Prediction
💡 Analogy: Imagine a game-show wheel with wedges sized by probability. "the" gets the biggest wedge (42%), "a" gets a medium one (18%), and rare words get tiny slivers. The temperature knob changes how much the wheel is "weighted" — low temperature makes the big wedge even bigger; high temperature makes all wedges more equal.
The final layer's output is projected to vocabulary size (50,257 for GPT-2) via a linear layer, then softmax converts raw scores (logits) into probabilities. The model samples from this distribution to pick the next token. Temperature controls randomness — lower = more deterministic.
"The cat sat on ___" → Next token probabilities:
The highest-probability token ("the" at 42%) is the most likely next word. With temperature → 0 (greedy decoding), the model always picks the top token. Higher temperature flattens the distribution, introducing more randomness and creativity.
How Models Generate Text (Autoregressive Decoding)
Language models don't produce a full sentence at once. They generate one token at a time, adding each new token to the context and repeating. This is called autoregressive decoding.
Why Transformers Beat RNNs
Transformers revolutionized NLP by processing sequences in parallel instead of sequentially:
Model Landscape
A comparison of major language models (LLMs — Large Language Models) — their architectures, context windows, and accessibility. Not all modern models use Transformers: SSMs (State Space Models, e.g. Mamba), modern RNNs (Recurrent Neural Networks, e.g. RWKV), and hybrids have emerged.
| Model ▲ | Company ▲ | Architecture ▲ | Type ▲ | Context ▲ | Open Weight ▲ |
|---|