Probability & statistics · 02 · What are we even measuring? · 7 min
EasyData, populations & samples
Before any formula: what is your data, and what does it represent? Almost every statistical error traces back to confusing the sample you measured with the population you care about. Get this distinction right and half of statistics becomes common sense.
Build the intuition
Population vs sample
The population is everyone (or everything) you want to conclude about — all voters, all possible users, all manufactured bolts. The sample is the slice you actually measured. Statistics is the art of saying something trustworthy about the population from only the sample, with the gap between them honestly accounted for.
Types of data shape your tools
Numerical data (height, price, temperature) supports means and spreads. Categorical data (color, yes/no, country) supports counts and proportions. Ordinal data (ratings, ranks) sits between. Picking the right summary and test starts with correctly naming what kind of data you hold.
Sampling bias: the silent killer
A sample only speaks for the population if it's representative. Survey gym members about exercise and you'll badly overestimate the nation. The 1936 Literary Digest poll asked 2.4 million people and still called the election wrong — they sampled car and phone owners during the Depression. Bias beats sample size every time: a biased mountain of data lies louder than a fair handful.
See it move
Each flip is a sample from the 'population' of all possible flips. Watch how the sample average approaches the true population value — but only because the sampling is fair.
A worked example
Spot the sampling flaw
An app shows a 4.8-star average. Trustworthy measure of user satisfaction?
Who leaves reviews? Mostly the delighted and the furious — the quiet middle is missing. The sample is self-selected.
The 4.8 describes reviewers, not users. To learn about all users, you'd need to sample them directly — a prompt to a random subset, say.
The number was real; the population it represented was the trap.
Out in the world
Why ML models fail in the wild
A model trained on data from one hospital often fails at another: the training sample wasn't representative of the deployment population. 'Distribution shift' is a sampling problem wearing an ML costume — the lesson of this page, with stakes.
Common confusion, cleared
“A bigger sample is always more trustworthy.”
Only if it's representative. A small random sample beats a huge biased one — size reduces noise, not bias. They're different diseases with different cures.
“Random sampling means haphazard.”
Random is rigorous: every member has a known chance of selection, which is exactly what lets statistics quantify the uncertainty. 'Whatever was convenient' is the opposite of random.
Check yourself
PracticeQuick check
An online poll on a news site finds 80% oppose a policy. The biggest concern is…
Recap
- Population = what you care about; sample = what you measured.
- Data type (numerical/categorical/ordinal) determines your tools.
- Representativeness beats size — sampling bias can't be out-collected.
Progress saves in this browser.