Data Preparation: Tokenization

Was this section helpful?

References

Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1 DOI: 10.18653/v1/P16-1162 - This foundational paper introduced Byte Pair Encoding (BPE) to the field of natural language processing for handling rare and out-of-vocabulary words in neural machine translation, establishing it as a common subword tokenization method.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2019 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Vol. 1 DOI: 10.18653/v1/N19-1423 - This seminal paper introduced the BERT model, which prominently uses WordPiece tokenization. It demonstrates the practical application and importance of subword tokenization in large-scale transformer models for language understanding.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Taku Kudo, 2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) DOI: 10.18653/v1/D18-2012 - This paper presents SentencePiece, a language-agnostic subword tokenizer that directly processes raw text, including whitespace, making it particularly suitable for diverse languages and consistent tokenization/detokenization.
tokenizers: Fast State-of-the-Art Tokenizers, Hugging Face, 2023 (Hugging Face) - The official documentation for the Hugging Face tokenizers library, which provides optimized implementations of BPE, WordPiece, and other subword tokenization algorithms used in transformer models.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A practical guide that includes detailed explanations and code examples for various tokenization methods, special tokens, and their application within the Hugging Face Transformers ecosystem.