Norms & distance

Machine learning runs on one question asked a billion times: how big is this, and how far apart are these? A norm measures a vector's length; a distance measures the gap between two. Choosing the right one shapes how a model sees similarity, error, and size.

Build the intuition

A norm is a length

The familiar Euclidean norm — √(x₁² + x₂² + …) — is Pythagoras in any dimension: the straight-line length of the arrow. It's written ‖x‖₂ (the '2' counts the squaring). Doubling a vector doubles its norm; the zero vector alone has norm 0. Length, generalized to 500 dimensions.

\|x\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}

Different rulers for different jobs

The L1 norm ‖x‖₁ = |x₁| + |x₂| + … sums absolute values — 'city-block' distance, walking the grid. It cares about every nonzero entry, which is why L1 regularization pushes weights to exactly zero, producing sparse models that ignore useless features. L2 spreads the penalty smoothly and shrinks without zeroing. Same vector, different notion of 'big' — and the choice is a modeling decision with real consequences.

\|x\|_1 = \sum_i |x_i|

Distance is the norm of the difference

How far apart are two points? Take their difference vector and measure its norm: dist(a, b) = ‖a − b‖. Euclidean distance for straight-line gaps, L1 for grid walks, cosine 'distance' for direction-only similarity (the dot product, normalized — from the earlier lesson). Recommendation engines, face recognition, and search all rank the world by exactly this.

\text{dist}(a, b) = \|a - b\|

See it move

InteractiveVectors: arrows that add

a →2

a ↑1

b →-1

b ↑2

a = (2, 1) + b = (-1, 2) = (1, 3) — tip to tail. Dot product 0: the two are nearly perpendicular — they share almost nothing.

A vector's norm is the length of its arrow; the distance between two is the length of the arrow connecting them. The dot-product readout previews cosine similarity — direction-only closeness.

A worked example

Measure an error two ways

A model predicted (3, 1) where the truth was (0, 5). The error vector is (3, −4).
L2 (Euclidean) error:
$\sqrt{3^2 + (-4)^2} = 5$
L1 (absolute) error:
$|3| + |-4| = 7$
L2 punishes the big −4 component heavily (squared); L1 weighs both linearly. Which 'wrong' you penalize is a design choice baked into the loss.

Out in the world

Why Lasso forgets features

L1-regularized regression (Lasso) adds ‖weights‖₁ to the loss. Because the L1 ball has sharp corners on the axes, the optimum often lands exactly on them — driving whole weights to zero and automatically selecting which features matter. The geometry of a norm becomes feature selection.

Common confusion, cleared

“All distances are the straight-line (Euclidean) one.”

Euclidean is one choice among many. L1, cosine, and others measure 'far' differently — and in high dimensions the right metric can make or break a model.

“Bigger norm always means a worse model.”

Norm measures size, not quality. We penalize large weight norms to prevent overfitting, but the useful signal also has size. Regularization is a balance, not a crusade against magnitude.

Check yourself

PracticeQuick check

Which regularizer tends to drive weights to exactly zero (feature selection)?

Recap

A norm measures a vector's length; ‖x‖₂ is Euclidean (Pythagoras everywhere).
L1 sums absolute values and encourages sparsity; L2 shrinks smoothly.
Distance is the norm of the difference — the basis of similarity in ML.

Progress saves in this browser.