Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.48550/arXiv.1508.07909 - Introduces Byte Pair Encoding (BPE) for subword tokenization in neural machine translation, a method conceptually similar to WordPiece.
Tokenizers in 🤗 Transformers, Hugging Face team, 2024 - Provides practical guidance and API reference for various tokenizers, including BertTokenizer and its WordPiece implementation.