Data, populations & samples

Before any formula: what is your data, and what does it represent? Almost every statistical error traces back to confusing the sample you measured with the population you care about. Get this distinction right and half of statistics becomes common sense.

Build the intuition

Population vs sample

The population is everyone (or everything) you want to conclude about — all voters, all possible users, all manufactured bolts. The sample is the slice you actually measured. Statistics is the art of saying something trustworthy about the population from only the sample, with the gap between them honestly accounted for.

Types of data shape your tools

Numerical data (height, price, temperature) supports means and spreads. Categorical data (color, yes/no, country) supports counts and proportions. Ordinal data (ratings, ranks) sits between. Picking the right summary and test starts with correctly naming what kind of data you hold.

Sampling bias: the silent killer

A sample only speaks for the population if it's representative. Survey gym members about exercise and you'll badly overestimate the nation. The 1936 Literary Digest poll asked 2.4 million people and still called the election wrong — they sampled car and phone owners during the Depression. Bias beats sample size every time: a biased mountain of data lies louder than a fair handful.

See it move

InteractiveThe law of large numbers

Flips40

After 40 flips: 23 heads = 57.5%. Single flips are pure chance — but the running average is drawn, inevitably, toward 50%. Randomness has long-run structure.

Each flip is a sample from the 'population' of all possible flips. Watch how the sample average approaches the true population value — but only because the sampling is fair.

A worked example

Spot the sampling flaw

An app shows a 4.8-star average. Trustworthy measure of user satisfaction?
Who leaves reviews? Mostly the delighted and the furious — the quiet middle is missing. The sample is self-selected.
The 4.8 describes reviewers, not users. To learn about all users, you'd need to sample them directly — a prompt to a random subset, say.
The number was real; the population it represented was the trap.

Out in the world

Why ML models fail in the wild

A model trained on data from one hospital often fails at another: the training sample wasn't representative of the deployment population. 'Distribution shift' is a sampling problem wearing an ML costume — the lesson of this page, with stakes.

Common confusion, cleared

“A bigger sample is always more trustworthy.”

Only if it's representative. A small random sample beats a huge biased one — size reduces noise, not bias. They're different diseases with different cures.

“Random sampling means haphazard.”

Random is rigorous: every member has a known chance of selection, which is exactly what lets statistics quantify the uncertainty. 'Whatever was convenient' is the opposite of random.

Check yourself

PracticeQuick check

An online poll on a news site finds 80% oppose a policy. The biggest concern is…

Recap

Population = what you care about; sample = what you measured.
Data type (numerical/categorical/ordinal) determines your tools.
Representativeness beats size — sampling bias can't be out-collected.

Progress saves in this browser.