Q-BERT: Quantizing BERT for Fast Inference, Sheng Shen, Zhenglun Ma, Kai Hou, Ruohui Ye, Zizheng Niu, Wenshuo Li, Feng Wu, Amir Gholami, Shunning Wei, Michael W. Mahoney, and Kurt Keutzer, 2019Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (NeurIPS)DOI: 10.48550/arXiv.1909.05846 - A foundational work that details challenges in quantizing Transformer models, including the sensitivity of softmax and normalization layers, and proposes solutions for maintaining accuracy.