Skip to content
LearnMathora

Optimization & gradient descent · 04 · Math for ML · 9 min

Hard

Gradient descent & learning rate

Armed with the gradient — the downhill direction — a model improves by repeatedly stepping that way. That loop is gradient descent, the algorithm that trains nearly everything in machine learning. Its success hinges on one humble number: the step size.

Build the intuition

The update rule

Stand on the loss surface, compute the gradient (steepest uphill), and step the opposite way by a small amount. Repeat. In symbols: new parameters = old − (learning rate)·gradient. Each step lowers the loss a little; thousands of steps walk the parameters into a valley. That four-symbol rule, looped, is how models learn.

wwηLw \leftarrow w - \eta\,\nabla L

The learning rate is everything

The learning rate η sets the step length. Too small: training crawls, wasting compute. Too large: steps overshoot the valley and bounce — or explode outward, the loss climbing instead of falling. There's a Goldilocks zone, and finding it is among the most common practical tasks in ML. Tuning η is half the art of training.

Stochastic descent: estimate, don't compute

Real datasets are huge, so computing the exact gradient over all data each step is too slow. Stochastic gradient descent uses a small random batch to estimate the gradient — noisier, but far cheaper and surprisingly effective: the noise even helps escape shallow traps. Variants like momentum and Adam adapt the step automatically. Modern training is SGD with clever step-size machinery.

See it move

InteractiveGradient descent on a loss surface
0.3
-3.2
8
After 8 steps at learning rate 0.3: loss 0.52, gradient -0.18. Still descending — add steps or raise the rate to reach the minimum.

Tune the learning rate and watch descent converge, crawl, or diverge. Too big overshoots and explodes; too small barely moves; just right slides into the valley — the daily reality of training.

A worked example

Take a gradient-descent step

  1. Loss L(w) = w², so the gradient is L′(w) = 2w. Start at w = 4, learning rate η = 0.1.

  2. Step: w ← 4 − 0.1·(2·4) = 4 − 0.8 = 3.2.

  3. Again: w ← 3.2 − 0.1·(6.4) = 2.56. Each step shrinks w toward 0, the minimum.

  4. With η = 1.1 instead, w ← 4 − 1.1·8 = −4.8 — overshoots past zero and grows. Too big a step diverges.

Out in the world

Learning-rate schedules in practice

Training large models starts with a 'warmup' (small steps while things stabilize), then a high rate, then a decay toward the end for fine settling. Whole research papers optimize this schedule. The abstract step-size of this lesson is, in practice, a carefully engineered curve — because η really is that important.

Common confusion, cleared

A bigger learning rate always trains faster.

Past a threshold, big steps overshoot and the loss diverges — training fails entirely. Faster-per-step is not faster-to-converge; there's a sweet spot, and beyond it lies chaos.

Gradient descent always finds the best answer.

On non-convex surfaces it can settle in a local minimum or saddle. In practice (and with SGD's helpful noise) the solutions found are usually good enough — but 'descended to a minimum' isn't 'found the global best'.

Check yourself

PracticeQuick check

  1. For L(w) = w² (gradient 2w), one step from w = 3 with η = 0.1 gives…

  2. Training loss suddenly explodes to huge values. The most likely culprit is…

Recap

  • Gradient descent: repeatedly step against the gradient by η.
  • The learning rate η must be tuned — too small crawls, too big diverges.
  • SGD estimates the gradient on batches; momentum and Adam adapt the step.

Progress saves in this browser.