Statistics for machine learning

Machine learning is, at heart, applied statistics at scale. Every model is an estimate, every prediction carries uncertainty, and the central challenge — generalizing from a sample to the world — is the sampling problem from lesson two, supercharged. This capstone connects statistics to the field it now powers.

Build the intuition

Generalization is the sampling problem

Training data is a sample; the real world is the population. A model that performs well on training but poorly in deployment has learned the sample's quirks, not the population's structure — the representativeness problem from lesson two, with parameters. Everything in ML evaluation exists to estimate population performance from sample performance honestly.

Overfitting and the bias–variance trade-off

A too-simple model misses real structure (high bias, underfitting); a too-complex one memorizes noise (high variance, overfitting). Total error decomposes into bias², variance, and irreducible noise — and reducing one often raises the other. The art is the sweet spot: complex enough to capture signal, simple enough to ignore noise. This trade-off is statistics' fingerprint on every model-selection decision.

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Honest evaluation: train, validate, test

Measuring a model on its training data is like grading students on questions they've seen — meaninglessly optimistic. So we hold out data: train on one slice, tune on a validation slice, and estimate real-world performance on an untouched test slice. Cross-validation reuses data efficiently for the same goal. It's all sampling discipline: keep the test set a fair, unseen sample of the population.

The convergence of the whole platform

Modern ML braids together every thread of this curriculum: probability (distributions, likelihood) defines the loss; calculus (gradients, optimization) minimizes it; linear algebra (vectors, matrices, projections, eigenvectors) is the computation; and statistics (estimation, uncertainty, generalization) keeps it honest. A trained model is MLE by gradient descent over a matrix pipeline, evaluated by sampling theory. That convergence is why these subjects belong on one platform.

See it move

InteractiveFit the line yourself

Slope0.5

Intercept3

Reveal the least-squares line

Total squared error: 11.9. The dashed residuals are your errors — regression finds the line that makes their squares' total as small as possible.

Overfitting in miniature: a wiggly line can chase every point to zero training error, but the straight least-squares fit generalizes. Simpler often predicts better.

A worked example

Diagnose a failing model

A model scores 99% on training data, 71% on test data. What's wrong?
The train–test gap screams overfitting: it memorized training noise (high variance) rather than learning generalizable structure.
Fixes are statistical: more data (shrinks variance), a simpler model (less capacity to memorize), or regularization (penalize complexity).
Without the held-out test sample, you'd have shipped a '99% accurate' model that fails in the world — the sampling lesson, with consequences.

Out in the world

Why models need uncertainty, not just answers

A self-driving car or medical AI that says '70% pedestrian' must act differently than one that says '99%'. Calibrated uncertainty — a statistical property — is what makes predictions safe to use. The frontier of trustworthy ML is statistics: knowing not just what the model predicts, but how much to believe it.

Common confusion, cleared

“More model complexity always means better predictions.”

Past the sweet spot, complexity fits noise and generalization worsens — the variance side of the trade-off. The best model is as simple as the data's signal allows, no simpler.

“High test accuracy means the model is trustworthy everywhere.”

Only on data resembling the test sample. Deploy on a different population (distribution shift) and performance can collapse — the sampling problem never sleeps. Calibrated uncertainty is the safeguard.

Check yourself

PracticeQuick check

Training accuracy 98%, test accuracy 72%. The diagnosis is…
Why hold out a test set never used in training?

Recap

Generalization is sampling: training is the sample, the world is the population.
Bias–variance trade-off: balance capturing signal against memorizing noise.
Held-out evaluation keeps estimates honest; ML braids probability, calculus, linear algebra, and statistics.

Progress saves in this browser.