Okay, we've reviewed the workhorses like SGD and Momentum, and we've touched upon the landscape of our optimization problems, noting that convexity makes life simpler, while the non-convex terrains of deep learning present significant hurdles like local minima and saddle points. But how do we rigorously compare different optimization algorithms? Simply observing if they eventually find a solution isn't enough. We need to quantify how quickly they approach a solution. This is the domain of convergence analysis.
Understanding convergence behavior is fundamental for selecting and tuning optimization algorithms effectively. It helps us answer questions like: Will this algorithm find a good solution? How many iterations or how much computation time will it likely take? Does it get stuck easily?
First, what does it mean for an algorithm, generating a sequence of iterates x0,x1,x2,…, to converge? Intuitively, it means the sequence gets closer and closer to a point of interest, typically a minimizer x∗ of our objective function f(x). More formally, we often look at one of these conditions:
While knowing if an algorithm converges is important, the critical aspect for practical purposes is the rate at which it converges.
The convergence rate describes how quickly the error (e.g., f(xk)−f(x∗) or ∣∣xk−x∗∣∣) decreases as the number of iterations k increases. Different algorithms exhibit distinct convergence characteristics. Let ek represent the error at iteration k.
This is generally the slowest type of convergence we encounter in useful algorithms. The error decreases, but at a rate slower than any constant factor per iteration. Typical rates include:
Many stochastic methods, like basic SGD applied to convex or non-convex functions, often exhibit sublinear convergence. While slow, they are often computationally cheap per iteration, making them viable for very large datasets.
This is a much more desirable rate. The error decreases by a constant factor ρ∈(0,1) at each iteration. Mathematically, for large k: ekek+1≈ρ<1 This means ek≈c⋅ρk for some constant c. On a logarithmic scale, the error decreases linearly. Gradient descent applied to strongly convex and smooth functions typically achieves linear convergence. The value of ρ is significant; a smaller ρ (e.g., 0.1) implies much faster convergence than a ρ close to 1 (e.g., 0.99).
Here, the ratio of successive errors approaches zero: limk→∞ekek+1=0 This means the convergence accelerates over time. Quasi-Newton methods (like BFGS and L-BFGS, discussed in Chapter 2) often exhibit superlinear convergence under suitable conditions.
This is an extremely fast rate of convergence. The error at the next step is proportional to the square of the current error: ek2ek+1≈M for some constant M. This implies that the number of correct digits in the solution roughly doubles at each iteration once the iterates are close enough to the solution. Newton's method (Chapter 2) is the classic example of an algorithm with quadratic convergence, provided certain conditions are met near the solution.
The following chart illustrates these different rates conceptually, plotting the error (on a log scale) versus the number of iterations.
Comparing hypothetical error reduction over iterations for different convergence rates. Note the dramatic speed-up from sublinear to linear, and especially to quadratic convergence, visualized on a logarithmic error scale.
Theoretical convergence rates are not universal; they depend heavily on the properties of the objective function f and the specifics of the algorithm. Some important properties include:
Lipschitz Continuity of the Gradient (Smoothness): A function f is L-smooth if its gradient does not change arbitrarily quickly. Formally, there exists a constant L>0 such that: ∣∣∇f(x)−∇f(y)∣∣≤L∣∣x−y∣∣∀x,y Smoothness is a common assumption needed to prove convergence for many first-order methods, as it allows us to bound the function's change using the gradient. It's essential for selecting appropriate step sizes (learning rates).
Convexity: As discussed earlier, if f is convex, gradient descent (with appropriate step sizes) is guaranteed to converge to the global minimum value f(x∗). The rate might still be sublinear.
Strong Convexity: This is a stronger condition than simple convexity. A function f is μ-strongly convex if there exists a constant μ>0 such that: f(y)≥f(x)+∇f(x)T(y−x)+2μ∣∣y−x∣∣2∀x,y Essentially, the function has a quadratic lower bound. If f is both L-smooth and μ-strongly convex, standard gradient descent converges linearly with a rate related to the condition number κ=L/μ.
Non-Convexity: For the non-convex functions prevalent in deep learning, guarantees are much weaker. We typically cannot guarantee convergence to a global minimum. Analysis often focuses on showing convergence to a stationary point x∗ where ∣∣∇f(x∗)∣∣=0. As we noted, these can be local minima, saddle points, or even local maxima. Recent research has shown that many algorithms, including SGD variants, are effective at escaping saddle points under certain conditions.
While theoretical rates provide valuable insight, they don't tell the whole story.
Understanding these fundamental concepts of convergence analysis is essential before we proceed to more advanced algorithms in subsequent chapters. We'll see how second-order methods aim for faster rates by using curvature information, how adaptive methods adjust learning rates dynamically, and how techniques for large-scale and distributed settings manage computational and communication costs while striving for efficient convergence. Analyzing the convergence properties will be a recurring theme as we evaluate each new optimization technique.
© 2025 ApX Machine Learning