In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, David Patil, Dustin Patterson, David Agrawal, Gyan Mei, Rafael Walker, William R. Dean, Keith Gelatt, Matt Leffler, Aaron Severance, Anand Sitaram, Mark Horowitz, 2017ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture (ACM and IEEE Computer Society)DOI: 10.1145/3079897.3079898 - 分析Google张量处理单元(TPU)架构,概述其设计和机器学习工作负载的性能。
FP8 Formats for Deep Learning, Paulius Micikevicius, Dusan Stosic, Patrick Judd, John Kamalu, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Neil Burgess, Sangwon Ha, Richard Grisenthwaite, Naveen Mellempudi, Marius Cornea, Alexander Heinecke, Pradeep Dubey, 2022 (NVIDIA, Arm, Intel) - 描述NVIDIA的FP8格式,其在深度学习中的应用,以及Hopper等架构上的硬件支持。