Probability & statistics · 04 · Reading a distribution · 8 min
EasyShape, percentiles & spread
A single average hides almost everything interesting about data. The shape of a distribution — where it piles up, how it spreads, whether it leans — is where the real information lives. This lesson teaches you to read that shape from figures.
Build the intuition
Percentiles & quartiles: position by rank
The 90th percentile is the value below which 90% of the data falls. Quartiles cut the data into four equal-count groups: Q1 (25th), the median (50th), Q3 (75th). 'Your baby is in the 60th percentile for height' means 60% of babies are shorter — a position, not a score. Percentiles describe data by rank, immune to outliers.
The IQR and the box plot
The interquartile range, IQR = Q3 − Q1, captures the middle 50% of the data in one number — a spread measure that ignores extremes. A box plot draws it: a box from Q1 to Q3, a line at the median, whiskers to the typical range, and dots for outliers. One glance reveals center, spread, skew, and anomalies.
Skewness: which way it leans
Symmetric data (heights) has a centered peak with matching tails. Right-skewed data (incomes, house prices, wait times) has a long right tail — a few huge values pulling the mean above the median. Left-skewed leans the other way. The mean-vs-median gap is a skewness detector: mean above median means right-skewed, and the median is the more honest 'typical' value.
See it move
Start with the symmetric bell, then carry its ±σ and percentile intuition to skewed real-world data — where mean and median part ways.
A worked example
Read a salary box plot
A company's salaries: Q1 = $52k, median = $68k, Q3 = $85k, with dots out near $400k (executives).
IQR = 85 − 52 = $33k — the middle half spans that range.
The long upper whisker and far outliers mean right skew: the mean salary is dragged well above the $68k median.
Quoting the mean here would mislead; the median $68k is the honest 'typical' salary. The box plot showed it instantly.
Out in the world
ML feature diagnostics
Before training, data scientists box-plot every feature: skewed features get log-transformed, outliers get investigated, and weird shapes reveal data-quality bugs. Reading distribution shape is step one of every serious modeling pipeline.
Common confusion, cleared
“The 90th percentile means a score of 90%.”
It means 90% of values fall below it — a rank position. The actual value could be anything; percentile is about standing, not magnitude.
“Outliers should always be removed.”
Sometimes they're errors; sometimes they're the most important data (fraud, failures, breakthroughs). Box plots flag them for investigation, not automatic deletion.
Check yourself
PracticeQuick check
Mean = $71k but median = $58k. The distribution is…
Recap
- Percentiles and quartiles describe data by rank, resisting outliers.
- IQR = Q3 − Q1 is the middle-50% spread; box plots draw it at a glance.
- Skew shows as a mean–median gap and a long tail — trust the median when skewed.
Progress saves in this browser.