Correlation & regression

Two quantities that move together — study time and grades, ad spend and sales — invite two questions: how strongly are they related, and can we predict one from the other? Correlation answers the first, regression the second, and both are foundational to data science.

Build the intuition

Covariance and correlation

Covariance measures whether two variables move together: positive when they rise and fall in sync, negative when one rises as the other falls. But its size depends on units (covariance of height-and-weight changes if you switch to inches). Correlation normalizes it to a unitless −1…+1: +1 perfect upward line, 0 no linear relation, −1 perfect downward. It's the standardized strength of a linear relationship.

r = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]

Linear regression: the best-fit line

Regression fits a line ŷ = mx + b to predict y from x. 'Best' means least squares: minimize the total squared vertical distance from points to line — the residuals. That optimization (calculus + linear algebra, met in both courses) has a clean closed-form solution. The slope tells you how much y changes per unit of x; the line becomes a prediction machine.

\hat{y} = mx + b, \quad \min \sum (y_i - \hat{y}_i)^2

Residuals: where the model confesses

Residuals — the leftover errors after fitting — are where you check whether the line deserves trust. Patternless residuals scattered around zero mean the linear model captured the structure. A curve or fan in the residuals means it didn't (try a transform or a nonlinear model). R², the fraction of variance the line explains, summarizes the fit in one number. Always plot residuals; the headline R² can hide a bad model.

Correlation is not causation

Ice-cream sales correlate with drownings — both driven by summer heat, neither causing the other. A confounding variable produces correlation without causation. Regression predicts; it does not, by itself, explain. Establishing causation needs experiments (randomization) or careful causal reasoning — a discipline this lesson opens but doesn't close.

See it move

InteractiveFit the line yourself

Slope0.5

Intercept3

Reveal the least-squares line

Total squared error: 11.9. The dashed residuals are your errors — regression finds the line that makes their squares' total as small as possible.

Fit the line by hand, watch the squared error fall, then reveal the least-squares optimum. The dashed residuals are exactly what regression minimizes.

A worked example

Read a regression result

Fitting exam score on hours studied gives ŷ = 8.1·hours + 41, with R² = 0.62.
Slope 8.1: each extra study hour predicts ~8 more points. Intercept 41: predicted score with zero study.
R² = 0.62: studying explains 62% of score variance — substantial, but 38% is other factors (sleep, prior knowledge, luck).
Useful for prediction. But it doesn't prove studying causes scores — though here, unlike ice cream, a causal story is plausible.

Out in the world

Regression is the gateway to ML

Linear regression generalizes directly into logistic regression (classification), regularized regression (ridge/lasso), and ultimately neural networks — all 'fit parameters to minimize error on data'. Covariance matrices drive PCA, the workhorse of dimensionality reduction. Master this lesson and modern ML reads like extensions of it.

Common confusion, cleared

“A strong correlation means one variable causes the other.”

Correlation can come from coincidence, reverse causation, or a hidden confounder. Causation requires experiments or causal assumptions — the most expensive mistake in data analysis is skipping this.

“A high R² means a good, trustworthy model.”

R² can be high while residuals reveal the model is systematically wrong (nonlinear, heteroscedastic). And it says nothing about causation or generalization. Always plot residuals; never trust R² alone.

Check yourself

PracticeQuick check

Cities with more police have more crime — a positive correlation. The best reading is…
Regression's least-squares line minimizes…

Recap

Correlation is standardized covariance: unitless strength of a linear relationship in [−1, 1].
Regression fits the least-squares line — calculus and linear algebra, applied.
Residuals diagnose the fit; correlation never proves causation.

Progress saves in this browser.