Skip to content
LearnMathora

Optimization & gradient descent · 03 · Math for ML · 9 min

Hard

The chain rule & backpropagation

Neural networks are functions wrapped inside functions, dozens deep. To train them, you need the derivative of the whole stack with respect to every buried parameter. The chain rule makes this possible — and backpropagation is the chain rule, organized.

Build the intuition

The chain rule: rates multiply

If y depends on u, and u depends on x, then a nudge in x ripples through u to y — and the rates multiply: dy/dx = (dy/du)(du/dx). Like gears: turn the input gear, and the output gear's speed is the product of the ratios along the chain. For composed functions, sensitivities compound by multiplication.

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Deep nesting, same rule

A network computes f(g(h(x))) — layer after layer. The chain rule extends straight through: the derivative of the whole is the product of each layer's local derivative. Each layer only needs to know how to differentiate itself; the chain rule stitches the local pieces into the global gradient. Depth is just a longer product.

Backpropagation: the chain rule, reversed and reused

Computed naively, the chain rule repeats enormous work. Backpropagation runs it backward — from the loss toward the inputs — caching shared factors so each is computed once. The result: the gradient with respect to every parameter, at a cost comparable to a single forward pass. It's the same divide-and-reuse cleverness behind the FFT, applied to derivatives. This efficiency is why training billion-parameter models is possible at all.

See it move

InteractiveThe function machine
in 1.5
square it
out 2.25
1.5
Every input has exactly one output. Plotting all (input, output) pairs draws the curve — a picture of the rule itself.

Composition made visible: an input feeds an inner rule, whose output feeds an outer rule. The chain rule says the overall sensitivity is the product of the sensitivities at each stage.

A worked example

Differentiate a composition

  1. A tiny network: y = (3x + 1)². Let u = 3x + 1, so y = u².

  2. Local rates: dy/du = 2u, and du/dx = 3.

  3. Multiply along the chain:

    dydx=2u3=6(3x+1)\frac{dy}{dx} = 2u \cdot 3 = 6(3x+1)
  4. At x = 1: dy/dx = 6(4) = 24. Backprop does exactly this for every weight, layer by layer, reusing the shared pieces.

Out in the world

Why GPUs train networks

Backpropagation reduces to massive matrix multiplications — each layer's local derivatives, chained backward. GPUs exist to do exactly that arithmetic in parallel. The chain rule, plus linear algebra, plus fast hardware, is the entire engine of modern AI. Remove any one and deep learning stops.

Common confusion, cleared

Backpropagation is a special new kind of math.

It's the chain rule from calculus, organized to avoid redundant work. The genius is in the bookkeeping (compute each shared factor once), not in any new derivative rule.

Deeper networks need fundamentally harder derivatives.

Each layer's derivative is simple; depth just means a longer product. The chain rule handles ten layers exactly as it handles two — more factors, same rule.

Check yourself

PracticeQuick check

  1. For y = (3x + 1)², dy/dx at x = 1 is…

  2. Backpropagation is best described as…

Recap

  • Chain rule: for composed functions, multiply the rates along the chain.
  • A deep network's gradient is the product of each layer's local derivative.
  • Backpropagation is the chain rule run backward with reuse — the engine of deep learning.

Progress saves in this browser.