Skip to content
LearnMathora

Optimization & gradient descent · 01 · How models learn · 8 min

Medium

Loss functions & objectives

Before a model can learn, it needs to know what 'wrong' means. A loss function is that scorekeeper: one number measuring how far the model's predictions are from the truth. Learning is nothing more than making that number small.

Build the intuition

Loss turns 'wrong' into a number

Give the model an input, compare its prediction to the true answer, and score the gap. Squared error (prediction − truth)² is the classic: always positive, and it punishes big mistakes far more than small ones. Average the loss over all training examples and you get a single objective — the model's report card, condensed to one number to push downward.

L=1ni(y^iyi)2L = \frac{1}{n}\sum_i (\hat{y}_i - y_i)^2

Different losses, different values

Squared error suits regression; cross-entropy suits classification (it's the negative log-likelihood from statistics — fitting a model is maximizing likelihood). Absolute error (L1) tolerates outliers; squared error obsesses over them. The loss isn't a technicality — it encodes what you want the model to care about, and the wrong choice teaches the wrong lesson.

The loss landscape

Fix the data and let the model's parameters vary: the loss becomes a surface over parameter space — valleys of good settings, hills of bad ones. Training is a search for the lowest valley. With two parameters you can picture a literal landscape; real models have millions of dimensions, but the intuition holds: learning is descending a loss surface.

See it move

InteractiveFit the line yourself
0.5
3
Total squared error: 11.9. The dashed residuals are your errors — regression finds the line that makes their squares' total as small as possible.

Loss made concrete: the dashed residuals are the errors, and their squared total is the loss. Drag the line to lower it — you're doing by hand what training does automatically.

A worked example

Score two models

  1. True values: [2, 4, 6]. Model A predicts [2, 4, 5]; Model B predicts [1, 4, 6].

  2. Squared error A: 0 + 0 + 1 = 1. Squared error B: 1 + 0 + 0 = 1.

  3. Equal loss — both are off by 1 on one point. The loss function declares them equally good, and training would be indifferent between them.

  4. Change to a loss that hates large errors more, or weights some points, and the verdict shifts. The loss defines 'best'.

Out in the world

Why ChatGPT predicts the next word

Language models are trained with cross-entropy loss on next-word prediction: the objective rewards assigning high probability to the actual next token. That single loss, minimized over trillions of words, produces fluent text. The behavior emerges from the objective — choose the loss, and you choose what the model becomes.

Common confusion, cleared

The loss function is just a detail of the math.

It defines the goal. A model minimizing the wrong loss optimizes the wrong thing perfectly — a recipe for confident, useless predictions. The loss is the most consequential design choice in ML.

Zero loss is always the goal.

Zero training loss often means memorizing noise (overfitting). The real goal is low loss on unseen data — the bias–variance story from statistics. Perfect training scores can be a warning, not a triumph.

Check yourself

PracticeQuick check

  1. Squared-error loss, compared to absolute-error loss, …

Recap

  • A loss function scores how wrong predictions are, as one number.
  • The loss encodes what the model values; choose it deliberately.
  • Over parameter space, loss is a landscape — and training descends it.

Progress saves in this browser.