Skip to content
LearnMathora

Optimization & gradient descent · 02 · Math for ML · 9 min

Medium

Partial derivatives & the gradient

A model has many knobs, and the loss depends on all of them. A partial derivative asks: if I nudge just this one knob, how does the loss respond? Bundle every partial derivative together and you get the gradient — the single most important vector in machine learning.

Build the intuition

Partial derivatives: one knob at a time

When a function depends on several variables, the partial derivative ∂L/∂w₁ is the ordinary derivative with respect to w₁ while every other variable is held frozen. It answers a clean question: holding everything else fixed, how fast does the loss change as I move this one parameter? Same calculus you know, applied one direction at a time.

Lw1\frac{\partial L}{\partial w_1}

The gradient: all the partials at once

Stack the partial derivatives into a vector — that's the gradient, ∇L. Its remarkable property: it points in the direction of steepest increase of the loss, and its negative points steepest downhill. So to reduce the loss fastest, step against the gradient. One vector tells a million parameters which way to move, together.

L=(Lw1,Lw2,)\nabla L = \left(\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \dots\right)

Surfaces, level sets, and steepest descent

Picture the loss as a hilly surface. Level sets (contour lines) connect points of equal loss, like a topographic map. The gradient is always perpendicular to the contours, pointing straight uphill — the steepest way up, so its negative is the steepest way down. A ball released on the surface rolls along −∇L. That mental image scales, unchanged, from two dimensions to two million.

See it move

InteractiveOptimization: the best box
max 16 at x = 1
0.35
Cut squares of x = 0.35 from a 6×6 sheet → volume 9.8. Slope: 20.7. Uphill — a bigger cut still helps.

A one-parameter slice of a loss surface: the slope here is a single partial derivative. At the optimum it reads zero — exactly where the full gradient vanishes. The gradient is this slope, computed in every direction at once.

A worked example

Compute a gradient

  1. Loss L(w₁, w₂) = w₁² + 3w₂². Find the gradient.

  2. Partial in w₁ (freeze w₂): ∂L/∂w₁ = 2w₁. Partial in w₂: ∂L/∂w₂ = 6w₂.

  3. Stack them:

    L=(2w1,  6w2)\nabla L = (2w_1,\; 6w_2)
  4. At (1, 1): ∇L = (2, 6). The loss rises steepest in that direction — so descent steps toward (−2, −6), faster along w₂ where the surface is steeper.

Out in the world

Millions of partials, one backward pass

Training a neural network computes the gradient of the loss with respect to every weight — often billions of partial derivatives — in a single efficient sweep called backpropagation (next lesson). That gradient vector is then used to nudge every weight slightly downhill. No gradient, no learning.

Common confusion, cleared

The gradient points toward the minimum.

It points toward the steepest increase — uphill. You follow its negative to descend. (And it points to the nearest steepest direction, not necessarily straight at the global minimum.)

Partial derivatives need new calculus.

They're ordinary derivatives with the other variables treated as constants. If you can differentiate one variable, you can take a partial — just hold the rest still.

Check yourself

PracticeQuick check

  1. To decrease the loss as fast as possible, step in the direction of…

  2. For L = w₁² + 3w₂², the partial ∂L/∂w₂ is…

Recap

  • A partial derivative varies one parameter, freezing the rest.
  • The gradient ∇L stacks all partials and points steepest uphill.
  • Models learn by stepping along −∇L, perpendicular to the loss contours.

Progress saves in this browser.