AutoDiff Engine + Transformer LM

Understanding deep learning by building it from scratch

Using PyTorch and calling .backward() is powerful, but it hides everything important. How does autograd actually compute gradients? What is a computational graph? How does a Transformer generate text token by token? This project answers those questions by building every component from scratch - no autograd, no nn.Module, just raw computation.

Computational Graph + Reverse-Mode Autodiff

Models computation as a graph of nodes (op + inputs + attributes). Builds the graph first, computes values lazily via an Evaluator - similar to TensorFlow v1 static graph execution.

Computational Graph Design

Each node contains an operation, its inputs, and attributes. Nothing is computed until the Evaluator.run() call traverses the graph in topological order - lazy evaluation as a first-class design choice.

Reverse-Mode Autodiff

gradients() constructs the backward graph. Each operator defines compute() for forward and gradient() for the backward rule. Gradients flow from loss back through every node in reverse topological order.

8 Operators Implemented

Division, Power, Mean, MatMul, Transpose - core arithmetic. ReLU, Softmax, LayerNorm - neural ops. Both forward and backward implemented for all. LayerNorm backward is the hardest - careful handling of the normalization denominator gradient.

Decoder-Only Transformer + Autoregressive Generation

Causal Self-Attention

Q=XWᵈ, K=XWᵈ, V=XWᵝ, scores=(QKᵀ/√d)+mask, output=Softmax(scores)VWₒ. Causal mask uses −1e9. All computed through the custom autodiff engine - gradients flow automatically.

Full Transformer Block

Token embedding → positional encoding → self-attention → residual+LayerNorm → FFN(ReLU) → residual+LayerNorm → logits. Standard GPT-style decoder block.

Training + Generation

SGD implemented manually. Gradients summed across batch dimension before weight update. Autoregressive decoding: argmax next token, append, repeat full forward pass. No KV cache - O(T³) complexity analyzed as bonus insight.

# Causal attention  -  all via custom autodiff, no PyTorch autograd
scores = matmul(Q, transpose(K)) / sqrt(d_k)
scores = add(scores, causal_mask)   # mask future tokens with -1e9
attn   = softmax(scores)            # custom Softmax op
output = matmul(matmul(attn, V), W_o)

# Loss + backward  -  gradients flow back through entire graph
loss = cross_entropy(logits, targets)
grads = gradients(loss, params)     # reverse-mode autodiff
for param, grad in zip(params, grads):
    param -= lr * grad.sum(dim=0)   # manual SGD

Technologies Used

ComponentTool & Purpose
LanguagePython - all implementation
TensorsPyTorch tensors - data containers only, autograd completely disabled
AutoDiffCustom reverse-mode - computational graph + topological evaluation
ModelDecoder-only Transformer - GPT-style: embedding, attention, FFN, LayerNorm
TrainingCustom SGD - manual gradient computation, batch-summed weight update
GenerationAutoregressive decoding - argmax, append, full forward pass per step (O(T³))

Results

8/10
training sentences memorized after training
0
lines of PyTorch autograd used
8
operators with full forward + backward
  • Backpropagation is just reverse graph traversal - proved by implementing it from scratch
  • Transformer architecture decomposed into 8 simple matrix operation primitives
  • Causal mask correctly prevents future token leakage during training and generation
  • Naive decoding O(T³) vs KV cache O(T²) - tradeoff clearly understood and documented

What I took away

  • Autograd is a graph problem - building the forward graph first, then differentiating it, is more elegant than implementing gradients ad hoc per operation.
  • LayerNorm backward is the hardest operator to get right - careful handling of the normalization denominator gradient is essential.
  • Transformers are not magic - they are compositions of MatMul, Softmax, LayerNorm, and residual connections, each with well-understood gradients.
  • Autoregressive generation without KV cache recomputes full attention every step - O(T³) cost becomes obvious at sequence length 50+.