Skip to content
LearnMathora

Probability & statistics · 10 · Best guesses, justified · 10 min

Hard

Estimation, likelihood & MLE

Given data, what's the best estimate of the unknown that produced it? Likelihood reverses probability's question — instead of 'how likely is this data given the parameter?', it asks 'which parameter makes this data most likely?'. Maximum likelihood estimation turns that into the workhorse method of modern statistics and ML.

Build the intuition

Likelihood: probability, read backwards

Probability fixes the parameter and asks about data; likelihood fixes the data and asks about the parameter. Same formula, opposite question. If 7 of 10 trials succeeded, the likelihood of success-rate p is p⁷(1−p)³ — a function of p we can maximize. The value of p that makes the observed data most probable is the maximum likelihood estimate.

L(θ)=P(dataθ)\mathcal{L}(\theta) = P(\text{data} \mid \theta)

Maximizing (and why we take logs)

MLE finds the peak of the likelihood — a calculus problem: differentiate, set to zero, solve (optimization, met in calculus). We maximize the log-likelihood instead, because products become sums (numerically stable, easier to differentiate) and the log's peak is at the same place. For the 7-of-10 data, calculus gives the unsurprising answer p̂ = 0.7 — but the machinery scales to models with millions of parameters.

θ^=argmaxθ  logL(θ)\hat{\theta} = \arg\max_\theta \;\log \mathcal{L}(\theta)

Confidence intervals: estimates with honesty

A point estimate alone hides its uncertainty. A 95% confidence interval gives a range that, under repeated sampling, captures the true value 95% of the time. Its width shrinks like 1/√n — more data, tighter range. Reporting '0.7 ± 0.09' instead of '0.7' is the difference between a number and a defensible claim.

p^±1.96p^(1p^)n\hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

Two ways to be wrong

In testing, a Type I error is a false alarm (rejecting a true null — 'found' an effect that isn't there); a Type II error is a miss (failing to detect a real effect). You can't minimize both at once — demanding more evidence to avoid false alarms makes misses more likely. Significance level (α) sets your false-alarm tolerance; power (1 − β) is your chance of catching real effects. Every test is a chosen balance between these failures.

See it move

InteractiveThe bell curve
0
1
1
Mean 0, spread σ = 1. Within ±1σ of the mean lives 68.3% of everything. (±1σ ≈ 68%, ±2σ ≈ 95% — the most useful rule of thumb in statistics.)

A confidence interval is a window on a distribution: widen it (more confidence) and you capture more of the truth but say less precisely where it is — the estimation trade-off.

A worked example

Estimate a conversion rate, honestly

  1. A new checkout converts 84 of 400 visitors. Best estimate of the true rate?

  2. MLE for a proportion is just the sample fraction: p̂ = 84/400 = 0.21.

  3. But report the uncertainty:

    0.21±1.960.210.794000.21±0.040.21 \pm 1.96\sqrt{\frac{0.21 \cdot 0.79}{400}} \approx 0.21 \pm 0.04
  4. True rate is likely between 17% and 25%. Shipping the decision on '21%' alone hides a range wide enough to matter.

Out in the world

Training a neural network is MLE

Minimizing cross-entropy loss is exactly maximizing the likelihood of the training labels under the model — negative log-likelihood, descended by gradient descent. The optimization you met in calculus, the likelihood you meet here, and the linear algebra of the network all converge in a single training loop.

Common confusion, cleared

A 95% confidence interval means 95% chance the true value is inside this range.

Subtle but crucial: the procedure captures the truth 95% of the time across many samples. For any one computed interval, the truth is either in it or not — the 95% describes the method, not this instance.

MLE is a statistics-only tool.

It's the conceptual core of supervised ML. 'Fit the model to data' almost always means 'maximize likelihood' (or minimize its negative log) — from logistic regression to language models.

Check yourself

PracticeQuick check

  1. Demanding stronger evidence before declaring an effect real will…

  2. Training a classifier by minimizing cross-entropy is equivalent to…

Recap

  • Likelihood asks which parameter makes the observed data most probable.
  • MLE maximizes (log-)likelihood — an optimization, solved by calculus or gradient descent.
  • Confidence intervals attach honest uncertainty; Type I/II errors are the two ways to be wrong.

Progress saves in this browser.