Training machine learning models often involves datasets that are too large to process in a single batch. Standard gradient descent, which requires computing the gradient over the entire dataset, becomes computationally infeasible. While Stochastic Gradient Descent (SGD) provides a scalable alternative by using gradients from small data samples (mini-batches), this introduces significant variance into the optimization process, potentially slowing down convergence.
This chapter addresses the specific challenges of optimizing models with massive datasets. You will learn about:
We will examine how these techniques allow for efficient training on datasets that would otherwise be intractable. Practical implementations and the analysis of their convergence behavior will be key components.
4.1 Stochastic Gradient Descent Revisited: Variance Reduction
4.2 Stochastic Average Gradient (SAG)
4.3 Stochastic Variance Reduced Gradient (SVRG)
4.4 Mini-batch Gradient Descent Trade-offs
4.5 Asynchronous Stochastic Gradient Descent
4.6 Data Parallelism Strategies
4.7 Hands-on Practical: Implementing SVRG
© 2025 ApX Machine Learning