Optimization & gradient descent · 04 · Math for ML · 9 min
HardGradient descent & learning rate
Armed with the gradient — the downhill direction — a model improves by repeatedly stepping that way. That loop is gradient descent, the algorithm that trains nearly everything in machine learning. Its success hinges on one humble number: the step size.
Build the intuition
The update rule
Stand on the loss surface, compute the gradient (steepest uphill), and step the opposite way by a small amount. Repeat. In symbols: new parameters = old − (learning rate)·gradient. Each step lowers the loss a little; thousands of steps walk the parameters into a valley. That four-symbol rule, looped, is how models learn.
The learning rate is everything
The learning rate η sets the step length. Too small: training crawls, wasting compute. Too large: steps overshoot the valley and bounce — or explode outward, the loss climbing instead of falling. There's a Goldilocks zone, and finding it is among the most common practical tasks in ML. Tuning η is half the art of training.
Stochastic descent: estimate, don't compute
Real datasets are huge, so computing the exact gradient over all data each step is too slow. Stochastic gradient descent uses a small random batch to estimate the gradient — noisier, but far cheaper and surprisingly effective: the noise even helps escape shallow traps. Variants like momentum and Adam adapt the step automatically. Modern training is SGD with clever step-size machinery.
See it move
Tune the learning rate and watch descent converge, crawl, or diverge. Too big overshoots and explodes; too small barely moves; just right slides into the valley — the daily reality of training.
A worked example
Take a gradient-descent step
Loss L(w) = w², so the gradient is L′(w) = 2w. Start at w = 4, learning rate η = 0.1.
Step: w ← 4 − 0.1·(2·4) = 4 − 0.8 = 3.2.
Again: w ← 3.2 − 0.1·(6.4) = 2.56. Each step shrinks w toward 0, the minimum.
With η = 1.1 instead, w ← 4 − 1.1·8 = −4.8 — overshoots past zero and grows. Too big a step diverges.
Out in the world
Learning-rate schedules in practice
Training large models starts with a 'warmup' (small steps while things stabilize), then a high rate, then a decay toward the end for fine settling. Whole research papers optimize this schedule. The abstract step-size of this lesson is, in practice, a carefully engineered curve — because η really is that important.
Common confusion, cleared
“A bigger learning rate always trains faster.”
Past a threshold, big steps overshoot and the loss diverges — training fails entirely. Faster-per-step is not faster-to-converge; there's a sweet spot, and beyond it lies chaos.
“Gradient descent always finds the best answer.”
On non-convex surfaces it can settle in a local minimum or saddle. In practice (and with SGD's helpful noise) the solutions found are usually good enough — but 'descended to a minimum' isn't 'found the global best'.
Check yourself
PracticeQuick check
For L(w) = w² (gradient 2w), one step from w = 3 with η = 0.1 gives…
Training loss suddenly explodes to huge values. The most likely culprit is…
Recap
- Gradient descent: repeatedly step against the gradient by η.
- The learning rate η must be tuned — too small crawls, too big diverges.
- SGD estimates the gradient on batches; momentum and Adam adapt the step.
Progress saves in this browser.