How Tokenization Works

Language models don't read words — they read tokens. Text is split into subword pieces using BPE (Byte-Pair Encoding), each mapped to an integer ID and then to a dense vector (embedding). Try the presets or type your own text below.

💡 Analogy: Tokenization is like breaking a sentence into Scrabble tiles — except the "tile set" has ~50,000 subword entries. Each tile gets a unique ID number, which the model later uses to look up a 768-number vector capturing its meaning.

0
Characters
0
Tokens
Chars / Token

Each token ID looks up a row in a learned embedding matrix (e.g. 50,257 × 768 for GPT-2). The result is a 768-dimensional vector that captures the token's meaning.

Step 0: Input Tokens

💡 Think of it as: Assigning each word fragment a unique barcode number from a fixed catalog of ~50,000 entries.

The journey begins with raw text. The tokenizer splits the input into subword tokens — the atomic units the model processes. Each token is assigned an integer ID from the vocabulary (typically ~50,000 entries).

The Big Picture: Encoder–Decoder Architecture

The original Transformer (2017) uses an Encoder–Decoder design. Most modern LLMs (GPT, Claude, Llama) use only the Decoder half. During training, the model sees both input and expected output. During inference, the decoder generates tokens one at a time, feeding each output back as input.

Encoder
Feed-Forward Network (FFN)
Multi-Head Attention
#Token IDVector
1101[6, 8, 10]
2245[20, 1, 7]
Positional Encoding
Token IDVector
101[6, 8, 10]
245[20, 1, 7]
"I heard a dog bark"
↑ "puppy"
Decoder
Feed-Forward Network (FFN)
Multi-Head Attention (Cross)
Multi-Head Attention (Self)
#Token IDVector
1136[5, 9, 12]
Positional Encoding
Token IDVector
136[5, 9, 12]
"when my dog was a …"
Training: Both encoder & decoder see full sequences. Loss computed on decoder output.
Inference: Encoder processes input once. Decoder generates token-by-token autoregressively.

Token Examples

The464
cat3797
sat3332
on319
the262
mat2603
.13

Step 1: Embedding + Positional Encoding

💡 Analogy: The embedding table is like a massive dictionary where each word maps to a list of 768 numbers capturing its meaning. Positional encoding is like adding page numbers to a shuffled book — it restores order.

Each token ID is used to look up a row in the embedding matrix — a giant table of learned vectors. For GPT-2, this matrix has 50,257 rows × 768 columns. Then a positional encoding is added so the model knows word order (since attention has no inherent notion of position).

Embedding Lookup Table

TokenIDVector (768-d)
The4640.12-0.340.87-0.020.56… ×768
cat3797-0.450.910.230.67-0.11… ×768
sat33320.33-0.220.54-0.780.19… ×768

Positional Encoding (Sine Waves)

Different dimensions use sine/cosine waves at different frequencies. Each position gets a unique "fingerprint" that tells the model where it is in the sequence.

Word Embedding Space (2D Projection)

In reality, embeddings live in 768+ dimensions. Here we project to 2D to visualize how semantically similar words cluster together. Notice: dog, puppy, and cat point in similar directions — they share meaning. skateboard and car point elsewhere.

Why Positional Encoding Matters

Attention has no notion of position by default — it can attend equally to any token in any order. Positional encoding tells the model where each token is in the sequence. Slide the bar below to remove position information and watch attention patterns degrade:

0%

Attention Pattern (Degrading)

💡 With position info removed, the model loses track of word order. Attention becomes confused.

Step 2: Self-Attention

💡 Analogy: Imagine reading a sentence with 8 different-colored highlighters at once. Each highlighter focuses on a different type of relationship: who does what, what modifies what, what refers to what. Multi-head attention lets the model "highlight" the sentence in 8 ways simultaneously.

The core innovation of Transformers. Each token creates three vectors — Query (Q), Key (K), and Value (V) — by multiplying with learned weight matrices. The attention score between two tokens is computed as softmax(Q·Kᵀ / √dk), where dk is the dimension of the key vectors (enables stable gradients).

Q — Query

🔍

"What am I looking for?"
Like a search query you type.

K — Key

🏷️

"What do I contain?"
Like a page title or tag.

V — Value

📄

"Here's my actual content."
The information to retrieve.

Interactive: Click a word to see its attention

Attention Heatmap

Each cell shows how much attention the row-word pays to the column-word. Click a word above to highlight its row.

Multi-Head Attention (8 heads)

Instead of one attention computation, the model runs 8 parallel heads — each learning to focus on different relationships (syntax, coreference, semantics, etc.). Their outputs are concatenated and projected.

The
cat
sat
H1Subject–Verb
H2Adj–Noun
H3Coreference
H4Positional
H5Semantic
H6Syntax
H7Negation
H8Distance
Concat
Linear
Projection
The'
cat'
sat'

Each head independently computes attention → outputs are concatenated → projected back to model dimension

Deep Dive: Q·K·V Computation

Click a word to see how Query (Q) matches against all Keys (K) to produce attention scores. The softmax converts these scores to percentages, which weight the Values (V) in the output.

Real Sentence Examples

See how attention mechanisms help the model understand different linguistic phenomena:

Why Multi-Head Attention?

Compare what a single attention head can capture vs. the full multi-head power:

Step 3: Add & Normalize

💡 Analogy: Residual connections are like keeping a backup of your original notes. After the attention layer "annotates" your notes, you merge the annotations with the originals — ensuring no information is lost. Layer Norm then tidies everything up so the values don't explode or vanish.

After attention, a residual connection adds the original input back to the output (x + Sublayer(x)). This is followed by Layer Normalization, which stabilizes training by normalizing values across the feature dimension. Residual connections allow gradients to flow easily through deep networks (solving the vanishing gradient problem).

Input
(from previous layer)
Self-Attention
or FFN
+
Input
(skip connection)
Layer Norm
normalize
Output

Layer Normalization — Animated

Watch how wildly varying values get "tamed" into a stable range (mean≈0, std≈1). This prevents values from exploding or vanishing through 96+ layers.

Before Norm
x̂ = (x − μ) / σ
After Norm

Step 4: FFN — Feed-Forward Network (also called MLP — Multi-Layer Perceptron)

💡 Analogy: If attention is "which words should I look at?", the feed-forward network is "what do I know about what I just saw?" It's the model's memory bank — expanding the data to search a broader set of patterns, then compressing it back to a manageable size.

After attention, each token passes independently through a 2-layer fully-connected network. This is where the model stores and retrieves factual knowledge. The hidden layer expands the dimension (768 → 3072), applies the GELU (Gaussian Error Linear Unit) activation, then compresses back (3072 → 768).

Input: 768-d Hidden: 3072-d + GELU Output: 768-d

GELU (Gaussian Error Linear Unit) is a smooth activation function that gates values based on their probability of being positive — it's like a "soft switch" that decides which neurons should fire.

Why "MLP"? MLP stands for Multi-Layer Perceptron — the simplest type of neural network with an input layer, hidden layer(s), and output layer. In Transformers, the FFN block is a 2-layer MLP applied independently to each token.

Step 5: Architecture Types

The Transformer paper introduced an Encoder–Decoder design, but modern models use three main variants depending on the task. Additionally, non-transformer architectures have emerged as alternatives.

Transformer Variants

🔵
Encoder
Encoder-Only
Sees all tokens at once (bidirectional). Best for understanding.
BERT, RoBERTa, DeBERTa
🟢
Decoder
Decoder-Only
Sees only past tokens (causal/autoregressive). Best for generation.
GPT-4, Claude, Llama, Mistral
🟣
Encoder Decoder
Encoder-Decoder
Encoder reads input, decoder generates output. Best for translation.
T5, BART, mBART, Flan-T5

Beyond Transformers

SSM — State Space Model (Mamba)
State Space Models process sequences in linear time. No attention needed.
Mamba, S4, Hyena
🔄
Modern RNN — Recurrent Neural Network (RWKV)
Recurrent models with parallelizable training. Linear complexity at inference.
RWKV-6, Griffin
🧬
Hybrid (Transformer + SSM/MoE)
Combines Transformer attention with SSM (State Space Model) or MoE (Mixture of Experts) layers for efficiency.
Jamba 1.5, StripedHyena

Step 6: Stacked Layers

💡 Analogy: Each layer is like a successive round of editing. The first few layers handle syntax and grammar, middle layers capture relationships and facts, and the final layers refine the prediction. Like a document passing through 12+ editors, each specializing in a different aspect.

A single Transformer block (Attention → Add&Norm → FFN → Add&Norm) is repeated N times. GPT-2 has 12 layers, GPT-3 has 96 layers, Llama 3 405B has 126 layers. Each layer refines the token representations, building increasingly abstract understanding.

Input Embeddings + Positions
Transformer Block 1 — Attention + FFN (Feed-Forward)
Transformer Block 2 — Attention + FFN (Feed-Forward)
Transformer Block 3 — Attention + FFN (Feed-Forward)
⋮ (×N layers)
Transformer Block N — Attention + FFN
Final Layer Norm
Output Logits

Step 7: Output Prediction

💡 Analogy: Imagine a game-show wheel with wedges sized by probability. "the" gets the biggest wedge (42%), "a" gets a medium one (18%), and rare words get tiny slivers. The temperature knob changes how much the wheel is "weighted" — low temperature makes the big wedge even bigger; high temperature makes all wedges more equal.

The final layer's output is projected to vocabulary size (50,257 for GPT-2) via a linear layer, then softmax converts raw scores (logits) into probabilities. The model samples from this distribution to pick the next token. Temperature controls randomness — lower = more deterministic.

"The cat sat on ___" → Next token probabilities:

The highest-probability token ("the" at 42%) is the most likely next word. With temperature → 0 (greedy decoding), the model always picks the top token. Higher temperature flattens the distribution, introducing more randomness and creativity.

How Models Generate Text (Autoregressive Decoding)

Language models don't produce a full sentence at once. They generate one token at a time, adding each new token to the context and repeating. This is called autoregressive decoding.

Why Transformers Beat RNNs

Transformers revolutionized NLP by processing sequences in parallel instead of sequentially:

Model Landscape

A comparison of major language models (LLMs — Large Language Models) — their architectures, context windows, and accessibility. Not all modern models use Transformers: SSMs (State Space Models, e.g. Mamba), modern RNNs (Recurrent Neural Networks, e.g. RWKV), and hybrids have emerged.

Note: Context lengths and details reflect publicly known specifications as of mid-2025. Models are updated frequently — always check the provider's documentation for the latest.
Model Company Architecture Type Context Open Weight