Probability & statistics · 11 · Relationships between variables · 10 min
HardCorrelation & regression
Two quantities that move together — study time and grades, ad spend and sales — invite two questions: how strongly are they related, and can we predict one from the other? Correlation answers the first, regression the second, and both are foundational to data science.
Build the intuition
Covariance and correlation
Covariance measures whether two variables move together: positive when they rise and fall in sync, negative when one rises as the other falls. But its size depends on units (covariance of height-and-weight changes if you switch to inches). Correlation normalizes it to a unitless −1…+1: +1 perfect upward line, 0 no linear relation, −1 perfect downward. It's the standardized strength of a linear relationship.
Linear regression: the best-fit line
Regression fits a line ŷ = mx + b to predict y from x. 'Best' means least squares: minimize the total squared vertical distance from points to line — the residuals. That optimization (calculus + linear algebra, met in both courses) has a clean closed-form solution. The slope tells you how much y changes per unit of x; the line becomes a prediction machine.
Residuals: where the model confesses
Residuals — the leftover errors after fitting — are where you check whether the line deserves trust. Patternless residuals scattered around zero mean the linear model captured the structure. A curve or fan in the residuals means it didn't (try a transform or a nonlinear model). R², the fraction of variance the line explains, summarizes the fit in one number. Always plot residuals; the headline R² can hide a bad model.
Correlation is not causation
Ice-cream sales correlate with drownings — both driven by summer heat, neither causing the other. A confounding variable produces correlation without causation. Regression predicts; it does not, by itself, explain. Establishing causation needs experiments (randomization) or careful causal reasoning — a discipline this lesson opens but doesn't close.
See it move
Fit the line by hand, watch the squared error fall, then reveal the least-squares optimum. The dashed residuals are exactly what regression minimizes.
A worked example
Read a regression result
Fitting exam score on hours studied gives ŷ = 8.1·hours + 41, with R² = 0.62.
Slope 8.1: each extra study hour predicts ~8 more points. Intercept 41: predicted score with zero study.
R² = 0.62: studying explains 62% of score variance — substantial, but 38% is other factors (sleep, prior knowledge, luck).
Useful for prediction. But it doesn't prove studying causes scores — though here, unlike ice cream, a causal story is plausible.
Out in the world
Regression is the gateway to ML
Linear regression generalizes directly into logistic regression (classification), regularized regression (ridge/lasso), and ultimately neural networks — all 'fit parameters to minimize error on data'. Covariance matrices drive PCA, the workhorse of dimensionality reduction. Master this lesson and modern ML reads like extensions of it.
Common confusion, cleared
“A strong correlation means one variable causes the other.”
Correlation can come from coincidence, reverse causation, or a hidden confounder. Causation requires experiments or causal assumptions — the most expensive mistake in data analysis is skipping this.
“A high R² means a good, trustworthy model.”
R² can be high while residuals reveal the model is systematically wrong (nonlinear, heteroscedastic). And it says nothing about causation or generalization. Always plot residuals; never trust R² alone.
Check yourself
PracticeQuick check
Cities with more police have more crime — a positive correlation. The best reading is…
Regression's least-squares line minimizes…
Recap
- Correlation is standardized covariance: unitless strength of a linear relationship in [−1, 1].
- Regression fits the least-squares line — calculus and linear algebra, applied.
- Residuals diagnose the fit; correlation never proves causation.
Progress saves in this browser.