Optimization & gradient descent · 01 · How models learn · 8 min
MediumLoss functions & objectives
Before a model can learn, it needs to know what 'wrong' means. A loss function is that scorekeeper: one number measuring how far the model's predictions are from the truth. Learning is nothing more than making that number small.
Build the intuition
Loss turns 'wrong' into a number
Give the model an input, compare its prediction to the true answer, and score the gap. Squared error (prediction − truth)² is the classic: always positive, and it punishes big mistakes far more than small ones. Average the loss over all training examples and you get a single objective — the model's report card, condensed to one number to push downward.
Different losses, different values
Squared error suits regression; cross-entropy suits classification (it's the negative log-likelihood from statistics — fitting a model is maximizing likelihood). Absolute error (L1) tolerates outliers; squared error obsesses over them. The loss isn't a technicality — it encodes what you want the model to care about, and the wrong choice teaches the wrong lesson.
The loss landscape
Fix the data and let the model's parameters vary: the loss becomes a surface over parameter space — valleys of good settings, hills of bad ones. Training is a search for the lowest valley. With two parameters you can picture a literal landscape; real models have millions of dimensions, but the intuition holds: learning is descending a loss surface.
See it move
Loss made concrete: the dashed residuals are the errors, and their squared total is the loss. Drag the line to lower it — you're doing by hand what training does automatically.
A worked example
Score two models
True values: [2, 4, 6]. Model A predicts [2, 4, 5]; Model B predicts [1, 4, 6].
Squared error A: 0 + 0 + 1 = 1. Squared error B: 1 + 0 + 0 = 1.
Equal loss — both are off by 1 on one point. The loss function declares them equally good, and training would be indifferent between them.
Change to a loss that hates large errors more, or weights some points, and the verdict shifts. The loss defines 'best'.
Out in the world
Why ChatGPT predicts the next word
Language models are trained with cross-entropy loss on next-word prediction: the objective rewards assigning high probability to the actual next token. That single loss, minimized over trillions of words, produces fluent text. The behavior emerges from the objective — choose the loss, and you choose what the model becomes.
Common confusion, cleared
“The loss function is just a detail of the math.”
It defines the goal. A model minimizing the wrong loss optimizes the wrong thing perfectly — a recipe for confident, useless predictions. The loss is the most consequential design choice in ML.
“Zero loss is always the goal.”
Zero training loss often means memorizing noise (overfitting). The real goal is low loss on unseen data — the bias–variance story from statistics. Perfect training scores can be a warning, not a triumph.
Check yourself
PracticeQuick check
Squared-error loss, compared to absolute-error loss, …
Recap
- A loss function scores how wrong predictions are, as one number.
- The loss encodes what the model values; choose it deliberately.
- Over parameter space, loss is a landscape — and training descends it.
Progress saves in this browser.