BFloat16: The Secret to High Performance on Cloud TPUs, Shibo Wang, Pankaj Kanwar, 2019 (Google Cloud Blog) - Explains BFloat16's design, its role for deep learning due to its wide dynamic range, and its benefits for training large models efficiently on specialized hardware.
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018ICLR 2018DOI: 10.48550/arXiv.1710.03740 - Presents techniques for mixed-precision training, including the use of FP32 master weights and FP16 computations, a framework that BFloat16 training extends.
Automatic Mixed Precision training, PyTorch Developers, 2025 (PyTorch.org) - The official PyTorch documentation for Automatic Mixed Precision (AMP), detailing how to use torch.autocast for efficient BFloat16 training.
High-Performance Mixed-Precision Training for Deep Learning, Minseok Park, George K. Lee, Yunsup Lee, Michael O. Lee, 20192019 IEEE High Performance Extreme Computing Conference (HPEC) (IEEE)DOI: 10.1109/HPEC.2019.8916335 - Discusses the implementation and advantages of mixed-precision training, including BFloat16, within the context of hardware accelerators for deep learning.