Optimization & gradient descent · 03 · Math for ML · 9 min
HardThe chain rule & backpropagation
Neural networks are functions wrapped inside functions, dozens deep. To train them, you need the derivative of the whole stack with respect to every buried parameter. The chain rule makes this possible — and backpropagation is the chain rule, organized.
Build the intuition
The chain rule: rates multiply
If y depends on u, and u depends on x, then a nudge in x ripples through u to y — and the rates multiply: dy/dx = (dy/du)(du/dx). Like gears: turn the input gear, and the output gear's speed is the product of the ratios along the chain. For composed functions, sensitivities compound by multiplication.
Deep nesting, same rule
A network computes f(g(h(x))) — layer after layer. The chain rule extends straight through: the derivative of the whole is the product of each layer's local derivative. Each layer only needs to know how to differentiate itself; the chain rule stitches the local pieces into the global gradient. Depth is just a longer product.
Backpropagation: the chain rule, reversed and reused
Computed naively, the chain rule repeats enormous work. Backpropagation runs it backward — from the loss toward the inputs — caching shared factors so each is computed once. The result: the gradient with respect to every parameter, at a cost comparable to a single forward pass. It's the same divide-and-reuse cleverness behind the FFT, applied to derivatives. This efficiency is why training billion-parameter models is possible at all.
See it move
Composition made visible: an input feeds an inner rule, whose output feeds an outer rule. The chain rule says the overall sensitivity is the product of the sensitivities at each stage.
A worked example
Differentiate a composition
A tiny network: y = (3x + 1)². Let u = 3x + 1, so y = u².
Local rates: dy/du = 2u, and du/dx = 3.
Multiply along the chain:
At x = 1: dy/dx = 6(4) = 24. Backprop does exactly this for every weight, layer by layer, reusing the shared pieces.
Out in the world
Why GPUs train networks
Backpropagation reduces to massive matrix multiplications — each layer's local derivatives, chained backward. GPUs exist to do exactly that arithmetic in parallel. The chain rule, plus linear algebra, plus fast hardware, is the entire engine of modern AI. Remove any one and deep learning stops.
Common confusion, cleared
“Backpropagation is a special new kind of math.”
It's the chain rule from calculus, organized to avoid redundant work. The genius is in the bookkeeping (compute each shared factor once), not in any new derivative rule.
“Deeper networks need fundamentally harder derivatives.”
Each layer's derivative is simple; depth just means a longer product. The chain rule handles ten layers exactly as it handles two — more factors, same rule.
Check yourself
PracticeQuick check
For y = (3x + 1)², dy/dx at x = 1 is…
Backpropagation is best described as…
Recap
- Chain rule: for composed functions, multiply the rates along the chain.
- A deep network's gradient is the product of each layer's local derivative.
- Backpropagation is the chain rule run backward with reuse — the engine of deep learning.
Progress saves in this browser.