Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1 (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - 介绍了用于自然语言处理的字节对编码 (BPE) 算法,详细说明了其通过将词分解为子词单元来处理罕见词和词汇表外词的有效性,直接影响了序列长度和词汇表大小的讨论。
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - 介绍了Transformer架构,该架构是现代大型语言模型的基础。解释了自注意力机制,其计算成本随序列长度呈二次方增长($O(L^2)$),这是影响词汇表大小权衡的主要因素。
RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, 2019arXiv preprint arXiv:1907.11692DOI: 10.48550/arXiv.1907.11692 - 探讨了BERT类模型的优化预训练策略,包括使用比BERT的WordPiece更大的字节对编码(BPE)词汇表(50K tokens)所带来的影响,展示了改进以及词汇表大小的实际意义。