A New Algorithm for Data Compression, Philip Gage, 1994C/C++ Users Journal, Vol. 12 - Original paper introducing Byte Pair Encoding as a general data compression algorithm.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, and Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - Seminal work adapting BPE for neural machine translation, leading to its widespread adoption in NLP.
Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, 2019 (OpenAI) - Introduces Byte-Level BPE (BBPE) for tokenization in the GPT-2 model, addressing challenges with arbitrary Unicode text.