Optimization & gradient descent · 05 · Math for ML · 9 min
HardConvexity & constrained optimization
Why do some models train reliably to the best answer while others get stuck? The difference is convexity — the shape of the loss surface. And when optimization must respect limits (budgets, probabilities that sum to one), Lagrange multipliers handle the constraints with surprising elegance.
Build the intuition
Convex: one valley, no traps
A convex surface is a single smooth bowl — any local minimum is the global minimum, so gradient descent is guaranteed to find the best answer from anywhere. Linear and logistic regression have convex losses, which is why they train reliably. Convexity is the mathematical promise that 'downhill' always leads home.
Non-convex: the real, bumpy world
Neural network losses are wildly non-convex — a landscape of many valleys, ridges, and saddles. Gradient descent can settle in a local minimum that isn't the global best. Remarkably, in very high dimensions most local minima turn out nearly as good as the global one, and SGD's noise helps skip the bad ones — which is why deep learning works despite no convexity guarantee. We optimize the bumpy surface and trust the empirics.
Constraints and Lagrange multipliers
Often you must optimize subject to a rule: minimize cost with a fixed budget, maximize likelihood with probabilities summing to one. Lagrange multipliers fold the constraint into the objective, turning a constrained problem into an unconstrained one. Geometrically, the optimum sits where the objective's contours just touch the constraint surface — their gradients align. It's the math behind SVM margins and constrained ML throughout.
See it move
Flip on the non-convex landscape and start descent from different points: sometimes it reaches the global valley, sometimes a local trap. Convex problems (default) never trap — every path leads to the one minimum.
A worked example
A constrained optimum by alignment
Maximize f(x, y) = xy subject to staying on the circle x² + y² = 1 (a fixed budget).
Lagrange's condition: ∇f and ∇g point the same way, ∇f = λ∇g. This gives (y, x) = λ(2x, 2y).
Solving: x = y, and on the circle that's x = y = 1/√2.
Max value xy = ½ — found where the contours of xy just kiss the circle. The constraint and objective gradients aligned at the answer.
Out in the world
Support vector machines
An SVM finds the decision boundary with the widest margin — a constrained optimization solved with Lagrange multipliers, where the 'support vectors' are exactly the points whose constraints are active. The clean geometry of constrained optimization became one of ML's most elegant classifiers.
Common confusion, cleared
“Neural networks must be convex to work.”
They're deeply non-convex, yet train well — because high-dimensional landscapes have many good-enough minima and SGD navigates them effectively. 'No global guarantee' doesn't mean 'doesn't work'.
“Constraints just rule out some answers.”
They reshape the optimum: the best constrained solution usually differs from the unconstrained one, sitting where objective and constraint gradients align. Lagrange multipliers find that exact spot.
Check yourself
PracticeQuick check
On a convex loss surface, gradient descent…
Lagrange multipliers are used to…
Recap
- Convex losses have one global minimum — descent always finds it (regression).
- Neural nets are non-convex but train well in high dimensions with SGD.
- Lagrange multipliers solve constrained problems by aligning gradients (∇f = λ∇g).
Progress saves in this browser.