Large Scale Distributed Deep Networks, Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, 2012Advances in Neural Information Processing Systems 25 (NeurIPS 2012) (NeurIPS) - Describes the DistBelief system, which extensively used asynchronous SGD for training large-scale deep learning models.
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, Benjamin Recht, Feng Niu, J. Christopher Liu, Rémi Récht, and Stephen J. Wright, 2011Advances in Neural Information Processing Systems 24 (NeurIPS 2011), Vol. 24 (Neural Information Processing Systems Foundation)DOI: 10.5591/978-1-60560-630-1.693 - Presents an early and influential lock-free parallel SGD method, demonstrating the effectiveness of asynchronous updates even with stale gradients.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering optimization methods for deep learning, including discussions on distributed training and asynchronous approaches.