Q-BERT: Quantizing BERT for Fast Inference, Sheng Shen, Zhenglun Ma, Kai Hou, Ruohui Ye, Zizheng Niu, Wenshuo Li, Feng Wu, Amir Gholami, Shunning Wei, Michael W. Mahoney, and Kurt Keutzer, 2019Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (NeurIPS)DOI: 10.48550/arXiv.1909.05846 - 一项基础工作,详细阐述了量化Transformer模型的挑战,包括Softmax和归一化层的敏感性,并提出了保持准确性的解决方案。