AutoDiff Engine + Transformer LM - Priyanka Nidadavolu

01 - The Problem

Understanding deep learning by building it from scratch

Using PyTorch and calling .backward() is powerful, but it hides everything important. How does autograd actually compute gradients? What is a computational graph? How does a Transformer generate text token by token? This project answers those questions by building every component from scratch - no autograd, no nn.Module, just raw computation.

02 - Part 1: AutoDiff Engine

Computational Graph + Reverse-Mode Autodiff

Models computation as a graph of nodes (op + inputs + attributes). Builds the graph first, computes values lazily via an Evaluator - similar to TensorFlow v1 static graph execution.

Computational Graph Design

Each node contains an operation, its inputs, and attributes. Nothing is computed until the Evaluator.run() call traverses the graph in topological order - lazy evaluation as a first-class design choice.

Reverse-Mode Autodiff

gradients() constructs the backward graph. Each operator defines compute() for forward and gradient() for the backward rule. Gradients flow from loss back through every node in reverse topological order.

8 Operators Implemented

Division, Power, Mean, MatMul, Transpose - core arithmetic. ReLU, Softmax, LayerNorm - neural ops. Both forward and backward implemented for all. LayerNorm backward is the hardest - careful handling of the normalization denominator gradient.

03 - Part 2: Transformer LM

Decoder-Only Transformer + Autoregressive Generation

Causal Self-Attention

Q=XWᵈ, K=XWᵈ, V=XWᵝ, scores=(QKᵀ/√d)+mask, output=Softmax(scores)VWₒ. Causal mask uses −1e9. All computed through the custom autodiff engine - gradients flow automatically.

Full Transformer Block

Token embedding → positional encoding → self-attention → residual+LayerNorm → FFN(ReLU) → residual+LayerNorm → logits. Standard GPT-style decoder block.

Training + Generation

SGD implemented manually. Gradients summed across batch dimension before weight update. Autoregressive decoding: argmax next token, append, repeat full forward pass. No KV cache - O(T³) complexity analyzed as bonus insight.

# Causal attention  -  all via custom autodiff, no PyTorch autograd
scores = matmul(Q, transpose(K)) / sqrt(d_k)
scores = add(scores, causal_mask)   # mask future tokens with -1e9
attn   = softmax(scores)            # custom Softmax op
output = matmul(matmul(attn, V), W_o)

# Loss + backward  -  gradients flow back through entire graph
loss = cross_entropy(logits, targets)
grads = gradients(loss, params)     # reverse-mode autodiff
for param, grad in zip(params, grads):
    param -= lr * grad.sum(dim=0)   # manual SGD

04 - Stack

Technologies Used

Component	Tool & Purpose
Language	Python - all implementation
Tensors	PyTorch tensors - data containers only, autograd completely disabled
AutoDiff	Custom reverse-mode - computational graph + topological evaluation
Model	Decoder-only Transformer - GPT-style: embedding, attention, FFN, LayerNorm
Training	Custom SGD - manual gradient computation, batch-summed weight update
Generation	Autoregressive decoding - argmax, append, full forward pass per step (O(T³))

05 - Outcomes

Results

8/10

training sentences memorized after training

lines of PyTorch autograd used

operators with full forward + backward

Backpropagation is just reverse graph traversal - proved by implementing it from scratch
Transformer architecture decomposed into 8 simple matrix operation primitives
Causal mask correctly prevents future token leakage during training and generation
Naive decoding O(T³) vs KV cache O(T²) - tradeoff clearly understood and documented

06 - Learnings

What I took away

Autograd is a graph problem - building the forward graph first, then differentiating it, is more elegant than implementing gradients ad hoc per operation.
LayerNorm backward is the hardest operator to get right - careful handling of the normalization denominator gradient is essential.
Transformers are not magic - they are compositions of MatMul, Softmax, LayerNorm, and residual connections, each with well-understood gradients.
Autoregressive generation without KV cache recomputes full attention every step - O(T³) cost becomes obvious at sequence length 50+.

All Projects 2025 · UC San Diego · UC San Diego