
This is a continuation of:
Part 1 explained the script as a story. This Part 2 makes it re-implementable by being very explicit about shapes.
I’ll use the script’s default hyperparameters:
vocab_size = Vblock_size = T = 16n_embd = C = 16n_head = H = 4head_dim = D = C/H = 4n_layer = L = 1
When I write shapes I’ll use:
- vectors:
[C] - matrices:
[out, in] - sequences of vectors (time steps):
[t, C]
1) Parameter shapes (what lives in state_dict)
Embeddings
- Token embedding table
wte:[V, C]- lookup with
token_idgives a vector[C]
- lookup with
- Position embedding table
wpe:[T, C]- lookup with
pos_idgives a vector[C]
- lookup with
Attention (per layer)
Each is a linear map from [C] → [C]:
attn_wq:[C, C]attn_wk:[C, C]attn_wv:[C, C]attn_wo:[C, C]
MLP (per layer)
Karpathy uses the classic 4× expansion:
mlp_fc1:[4C, C](up-project)mlp_fc2:[C, 4C](down-project)
LM head
lm_head:[V, C]- maps hidden state
[C]to logits[V]
- maps hidden state
2) The forward pass at one time step (one pos_id)
The script is “token-by-token” (one position at a time), but it still behaves like a normal Transformer because it keeps a key/value cache.
2.1 Embeddings
tok_emb = wte[token_id]→[C]pos_emb = wpe[pos_id]→[C]x = tok_emb + pos_emb→[C]x = rmsnorm(x)→[C]
2.2 Compute Q, K, V
Each is a linear projection [C] → [C]:
q = x @ Wq→[C]k = x @ Wk→[C]v = x @ Wv→[C]
Now the cache grows with time.
- Before appending,
keys[layer]has lengthpos_id(0-based). - After
keys[layer].append(k), it has lengthpos_id+1.
So at step pos_id = t, the cached keys/values are:
k_cache: list oft+1vectors of shape[C]→ conceptually[t+1, C]v_cache: list oft+1vectors of shape[C]→ conceptually[t+1, C]
2.3 Split into heads
We slice the channel dimension into H heads.
q_h→[D]for each headk_h[t]→[D]for each cached timetv_h[t]→[D]
2.4 Attention math (per head)
For a fixed head:
- Attention logits for each cached time
t:
- dot product:
q_h · k_h[t]→ scalar - scale: divide by
sqrt(D)→ scalar
So you get a vector of logits:
attn_logits→[t+1]
- Softmax:
attn_weights = softmax(attn_logits)→[t+1]
- Weighted sum of values:
head_out[j] = Σ_t attn_weights[t] * v_h[t][j]
Result:
head_out→[D]
Do this for all heads and concatenate:
x_attn→[C]
Then output projection:
x = x_attn @ Wo→[C]- residual add:
x = x + x_residual→[C]
2.5 MLP block
x = rmsnorm(x)→[C]x = x @ fc1→[4C]x = relu(x)→[4C]x = x @ fc2→[C]- residual add:
[C]
2.6 Logits
logits = x @ lm_head→[V]
Then:
probs = softmax(logits)→[V]- loss for the target token:
-log(probs[target_id])→ scalar
3) A tiny numeric toy forward pass (not the real model)
The real model uses C=16, H=4, etc. That’s too big for a “by hand” demo.
So here is a toy version with the same structure:
C = 4H = 2D = 2- we assume we are at time
t=1(so we have two cached tokens)
Setup
Let the query for one head be:
q_h = [1, 0]
Let cached keys be:
k_h[0] = [1, 0]k_h[1] = [0, 1]
Dot products:
q·k[0] = 1q·k[1] = 0
Scale by sqrt(D)=sqrt(2) ≈ 1.414:
- logits ≈
[0.707, 0.0]
Softmax:
- exp(0.707)=2.028, exp(0)=1.000
- sum=3.028
- weights ≈
[0.67, 0.33]
Let cached values be:
v_h[0] = [10, 0]v_h[1] = [0, 10]
Weighted sum:
- output ≈
0.67*[10,0] + 0.33*[0,10] - output ≈
[6.7, 3.3]
That’s what attention does: it mixes past value vectors based on similarity of query to keys.
4) What to implement first (if you’re rebuilding this)
If your goal is to reproduce microgpt.py, build in this order:
- Tokenizer + BOS
Valueautograd +backward()linear,softmax,rmsnorm- single-head attention (get it working)
- multi-head split/concat
- MLP + residuals
- LM head + NLL loss
- Adam update
- sampling loop
Once this works, optimize with NumPy/PyTorch.