A Stochastic Approximation Method, Herbert Robbins and Sutton Monro, 1951The Annals of Mathematical Statistics, Vol. 22 (Institute of Mathematical Statistics)DOI: 10.1214/aoms/1177729586 - This seminal paper introduced the stochastic approximation method, which forms the theoretical basis for Stochastic Gradient Descent, laying the groundwork for iterative optimization with noisy estimates.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Chapter 8, "Optimization for Training Deep Models," provides an extensive discussion of Stochastic Gradient Descent, covering its mechanisms, advantages, and practical considerations in machine learning.
Large-scale machine learning with stochastic gradient descent, Léon Bottou, 2010Proceedings of COMPSTAT'2010 (Physica-Verlag, Heidelberg)DOI: 10.1007/978-3-7908-2604-2_14 - This work provides valuable insights into the practical aspects and effectiveness of Stochastic Gradient Descent for handling large datasets, highlighting its computational efficiency and convergence properties.