Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.
Was this section helpful?
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.48550/arXiv.1508.07909 - Introduces Byte Pair Encoding (BPE) for subword tokenization in neural machine translation, a method conceptually similar to WordPiece.
Tokenizers in 🤗 Transformers, Hugging Face team, 2024 - Provides practical guidance and API reference for various tokenizers, including BertTokenizer and its WordPiece implementation.