Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Offers extensive coverage of gradient descent variants, including GD, SGD, Momentum, and NAG, foundational for deep learning.
On the importance of initialization and momentum in deep learning, Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, 2013Proceedings of the 30th International Conference on Machine Learning, Vol. 28 (PMLR) - Discusses the practical application and benefits of momentum, including Nesterov's accelerated gradient, within deep learning contexts.