Linear algebra · 09 · Math for ML · 11 min
HardMatrix decompositions & SVD
Every matrix, no matter how messy, can be broken into simple, meaningful pieces. The Singular Value Decomposition is the most powerful of these breakdowns — and arguably the single most useful idea in applied linear algebra. It reveals the hidden structure of any data, and it underlies PCA, compression, and recommendation.
Build the intuition
Decompositions: factoring a matrix
Just as 12 = 2 × 2 × 3 reveals a number's structure, matrix decompositions factor a matrix into simpler ones with clear jobs. LU and QR factor a matrix to solve systems and least-squares quickly and stably (the engines inside numerical libraries). Cholesky factors symmetric positive-definite matrices — the covariance matrices of statistics — into a 'square root', speeding up much of computational stats. These are the workhorses; SVD is the revelation.
SVD: rotate, stretch, rotate
The SVD says every matrix A does exactly three things in sequence: rotate (Vᵀ), stretch along axes (Σ), rotate again (U). The stretch factors are the singular values σ₁ ≥ σ₂ ≥ … — ordered by importance. Geometrically, A turns the unit sphere into an ellipsoid, and the singular values are the lengths of its axes. Any transformation, however tangled, is secretly this clean three-step move.
Low-rank approximation: keep the big stretches
Here's the payoff. SVD writes A as a sum of rank-1 layers, each weighted by a singular value: A = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + …. Because the σ's decrease fast for real data, the first few layers capture almost everything. Keep the top k, discard the rest, and you get the best possible rank-k approximation — the mathematical heart of compression and PCA. Throwing away small singular values often throws away noise, keeping signal.
SVD is PCA is structure-finding
Principal Component Analysis — the most-used dimensionality reduction in data science — is SVD applied to centered data: the top singular directions are the principal components, the axes of greatest variance. The same decomposition denoises images, fills in recommender matrices, compresses data, and exposes latent factors. One factorization, a dozen famous applications — all 'find the dominant structure and drop the rest'.
See it move
Rank-2 approximation
Full data (rank 8)
singular values σ₁ … σ₈ — kept layers are solid
SVD as ordered rank-1 layers: keep the top k and watch a data matrix rebuild from its strongest patterns first. A handful of layers captures most of the picture — that's compression, PCA, and denoising in one slider.
A worked example
Compress with three numbers per layer
A 1000×1000 image is a million numbers. Its SVD has 1000 singular values — but they decay fast.
Keep the top 50 layers. Each layer stores u (1000) + v (1000) + σ (1) ≈ 2001 numbers.
Total:
A 10× compression, often visually indistinguishable — because the dropped singular values were tiny. SVD knew which detail mattered.
Out in the world
Latent semantics & recommendations
Run SVD on a documents×words matrix and the top singular directions become topics (latent semantic analysis); on a users×items matrix they become taste factors that predict missing ratings. The 2009 Netflix Prize was won largely with SVD-style low-rank models. The decomposition discovers meaning no one labeled.
Common confusion, cleared
“SVD only works on special (square or invertible) matrices.”
Every matrix has an SVD — rectangular, singular, any shape. That universality is exactly why it's the go-to tool: it never fails to reveal structure.
“Low-rank approximation loses important information.”
It drops the smallest singular values — usually noise and fine detail. The Eckart–Young theorem guarantees it's the best possible approximation of that rank. Often you lose noise and keep signal.
Check yourself
PracticeQuick check
The singular values of a data matrix decay quickly. This tells you the data is…
PCA's principal components are, essentially…
Recap
- Decompositions factor a matrix into simple pieces; LU/QR/Cholesky solve and stabilize.
- SVD = rotate, stretch (by singular values), rotate — structure for any matrix.
- Keep the top-k layers for the best low-rank approximation: PCA, compression, denoising.
Progress saves in this browser.