Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 2014International Conference for Learning RepresentationsDOI: 10.48550/arXiv.1412.6980 - Introduces the Adam optimizer, explaining its mechanism and the memory overhead of its state variables, which are critical for fine-tuning large models.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018ICLR 2018DOI: 10.48550/arXiv.1710.03740 - Pioneering paper on mixed precision training (FP16/BF16), detailing its benefits in reducing memory footprint and accelerating computation for deep learning models, directly relevant to the memory section.
Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016arXiv preprint arXiv:1604.06174DOI: 10.48550/arXiv.1604.06174 - Presents gradient checkpointing, a technique to reduce activation memory during backpropagation by recomputing activations, which is crucial for training deep networks with limited GPU memory.
DeepSpeed-ZeRO: Memory Optimization for Training Billions-Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (IEEE)DOI: 10.1109/SC41405.2020.00008 - Introduces the ZeRO memory optimization strategy, fundamental for efficient training of large-scale language models, addressing challenges in memory, compute, and distributed training setups.