Skip to content
LearnMathora

Linear algebra · 09 · Math for ML · 11 min

Hard

Matrix decompositions & SVD

Every matrix, no matter how messy, can be broken into simple, meaningful pieces. The Singular Value Decomposition is the most powerful of these breakdowns — and arguably the single most useful idea in applied linear algebra. It reveals the hidden structure of any data, and it underlies PCA, compression, and recommendation.

Build the intuition

Decompositions: factoring a matrix

Just as 12 = 2 × 2 × 3 reveals a number's structure, matrix decompositions factor a matrix into simpler ones with clear jobs. LU and QR factor a matrix to solve systems and least-squares quickly and stably (the engines inside numerical libraries). Cholesky factors symmetric positive-definite matrices — the covariance matrices of statistics — into a 'square root', speeding up much of computational stats. These are the workhorses; SVD is the revelation.

SVD: rotate, stretch, rotate

The SVD says every matrix A does exactly three things in sequence: rotate (Vᵀ), stretch along axes (Σ), rotate again (U). The stretch factors are the singular values σ₁ ≥ σ₂ ≥ … — ordered by importance. Geometrically, A turns the unit sphere into an ellipsoid, and the singular values are the lengths of its axes. Any transformation, however tangled, is secretly this clean three-step move.

A=UΣVTA = U\,\Sigma\,V^{\mathsf{T}}

Low-rank approximation: keep the big stretches

Here's the payoff. SVD writes A as a sum of rank-1 layers, each weighted by a singular value: A = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + …. Because the σ's decrease fast for real data, the first few layers capture almost everything. Keep the top k, discard the rest, and you get the best possible rank-k approximation — the mathematical heart of compression and PCA. Throwing away small singular values often throws away noise, keeping signal.

Ai=1kσiuiviTA \approx \sum_{i=1}^{k} \sigma_i\, u_i v_i^{\mathsf{T}}

SVD is PCA is structure-finding

Principal Component Analysis — the most-used dimensionality reduction in data science — is SVD applied to centered data: the top singular directions are the principal components, the axes of greatest variance. The same decomposition denoises images, fills in recommender matrices, compresses data, and exposes latent factors. One factorization, a dozen famous applications — all 'find the dominant structure and drop the rest'.

See it move

InteractiveSVD low-rank approximation

Rank-2 approximation

Full data (rank 8)

singular values σ₁ … σ₈ — kept layers are solid

2
Keeping the top 2 of 8 layers captures 83.4% of the data's energy while storing 98 numbers instead of 576 (83% smaller). The biggest singular values already carry the broad structure.

SVD as ordered rank-1 layers: keep the top k and watch a data matrix rebuild from its strongest patterns first. A handful of layers captures most of the picture — that's compression, PCA, and denoising in one slider.

A worked example

Compress with three numbers per layer

  1. A 1000×1000 image is a million numbers. Its SVD has 1000 singular values — but they decay fast.

  2. Keep the top 50 layers. Each layer stores u (1000) + v (1000) + σ (1) ≈ 2001 numbers.

  3. Total:

    50×2001100,000 vs 1,000,00050 \times 2001 \approx 100{,}000 \text{ vs } 1{,}000{,}000
  4. A 10× compression, often visually indistinguishable — because the dropped singular values were tiny. SVD knew which detail mattered.

Out in the world

Latent semantics & recommendations

Run SVD on a documents×words matrix and the top singular directions become topics (latent semantic analysis); on a users×items matrix they become taste factors that predict missing ratings. The 2009 Netflix Prize was won largely with SVD-style low-rank models. The decomposition discovers meaning no one labeled.

Common confusion, cleared

SVD only works on special (square or invertible) matrices.

Every matrix has an SVD — rectangular, singular, any shape. That universality is exactly why it's the go-to tool: it never fails to reveal structure.

Low-rank approximation loses important information.

It drops the smallest singular values — usually noise and fine detail. The Eckart–Young theorem guarantees it's the best possible approximation of that rank. Often you lose noise and keep signal.

Check yourself

PracticeQuick check

  1. The singular values of a data matrix decay quickly. This tells you the data is…

  2. PCA's principal components are, essentially…

Recap

  • Decompositions factor a matrix into simple pieces; LU/QR/Cholesky solve and stabilize.
  • SVD = rotate, stretch (by singular values), rotate — structure for any matrix.
  • Keep the top-k layers for the best low-rank approximation: PCA, compression, denoising.

Progress saves in this browser.