Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Introduces the concept of mini-batch gradient descent and discusses techniques to approximate large batches.
Language Models are Few-Shot Learners, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei, 2020Advances in Neural Information Processing Systems (NeurIPS) 33DOI: 10.48550/arXiv.2005.14165 - Describes the training of GPT-3, a very large language model, where techniques like gradient accumulation are essential for memory management.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020International Conference for High Performance Computing, Networking, Storage and Analysis (SC)DOI: 10.48550/arXiv.1910.02054 - Presents a set of memory optimization techniques, including aspects that leverage or complement gradient accumulation for very large-scale model training.
Large Model Training Best Practices, Flax Developers, 2024 (Flax) - Official Flax guide detailing strategies for training large models, explicitly covering gradient accumulation as a core technique.