Understanding deep learning by building it from scratch
Using PyTorch and calling .backward() is powerful, but it hides everything important. How does autograd actually compute gradients? What is a computational graph? How does a Transformer generate text token by token? This project answers those questions by building every component from scratch - no autograd, no nn.Module, just raw computation.
Computational Graph + Reverse-Mode Autodiff
Models computation as a graph of nodes (op + inputs + attributes). Builds the graph first, computes values lazily via an Evaluator - similar to TensorFlow v1 static graph execution.
Computational Graph Design
Each node contains an operation, its inputs, and attributes. Nothing is computed until the Evaluator.run() call traverses the graph in topological order - lazy evaluation as a first-class design choice.
Reverse-Mode Autodiff
gradients() constructs the backward graph. Each operator defines compute() for forward and gradient() for the backward rule. Gradients flow from loss back through every node in reverse topological order.
8 Operators Implemented
Division, Power, Mean, MatMul, Transpose - core arithmetic. ReLU, Softmax, LayerNorm - neural ops. Both forward and backward implemented for all. LayerNorm backward is the hardest - careful handling of the normalization denominator gradient.
Decoder-Only Transformer + Autoregressive Generation
Causal Self-Attention
Q=XWᵈ, K=XWᵈ, V=XWᵝ, scores=(QKᵀ/√d)+mask, output=Softmax(scores)VWₒ. Causal mask uses −1e9. All computed through the custom autodiff engine - gradients flow automatically.
Full Transformer Block
Token embedding → positional encoding → self-attention → residual+LayerNorm → FFN(ReLU) → residual+LayerNorm → logits. Standard GPT-style decoder block.
Training + Generation
SGD implemented manually. Gradients summed across batch dimension before weight update. Autoregressive decoding: argmax next token, append, repeat full forward pass. No KV cache - O(T³) complexity analyzed as bonus insight.
# Causal attention - all via custom autodiff, no PyTorch autograd scores = matmul(Q, transpose(K)) / sqrt(d_k) scores = add(scores, causal_mask) # mask future tokens with -1e9 attn = softmax(scores) # custom Softmax op output = matmul(matmul(attn, V), W_o) # Loss + backward - gradients flow back through entire graph loss = cross_entropy(logits, targets) grads = gradients(loss, params) # reverse-mode autodiff for param, grad in zip(params, grads): param -= lr * grad.sum(dim=0) # manual SGD
Technologies Used
| Component | Tool & Purpose |
|---|---|
| Language | Python - all implementation |
| Tensors | PyTorch tensors - data containers only, autograd completely disabled |
| AutoDiff | Custom reverse-mode - computational graph + topological evaluation |
| Model | Decoder-only Transformer - GPT-style: embedding, attention, FFN, LayerNorm |
| Training | Custom SGD - manual gradient computation, batch-summed weight update |
| Generation | Autoregressive decoding - argmax, append, full forward pass per step (O(T³)) |
Results
- Backpropagation is just reverse graph traversal - proved by implementing it from scratch
- Transformer architecture decomposed into 8 simple matrix operation primitives
- Causal mask correctly prevents future token leakage during training and generation
- Naive decoding O(T³) vs KV cache O(T²) - tradeoff clearly understood and documented
What I took away
- Autograd is a graph problem - building the forward graph first, then differentiating it, is more elegant than implementing gradients ad hoc per operation.
- LayerNorm backward is the hardest operator to get right - careful handling of the normalization denominator gradient is essential.
- Transformers are not magic - they are compositions of MatMul, Softmax, LayerNorm, and residual connections, each with well-understood gradients.
- Autoregressive generation without KV cache recomputes full attention every step - O(T³) cost becomes obvious at sequence length 50+.