SentencePiece 实现

全新 · 开源

用于构建生产级 LLM 应用的 Python 工具包。提供提示词、RAG、智能体、结构化输出和多提供商支持等模块化实用工具。

这部分内容有帮助吗？

参考文献

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Taku Kudo, John Richardson, 2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/D18-2012 - 介绍SentencePiece，详细说明其语言无关的子词分词设计、显式空格处理以及BPE和Unigram模型的集成。
SentencePiece GitHub Repository and Documentation, Google, 2024 - 提供官方源代码、安装说明、命令行工具使用、Python API示例，以及训练参数和模型类型的详细解释。
Tokenizers - Hugging Face Documentation, Hugging Face, 2024 - 解释了包括SentencePiece在内的各种子词分词算法在现代自然语言处理模型中的原理，并提供实用的实现见解。
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Mario Šaško, 2022 (O'Reilly Media) - 概述分词技术，包括SentencePiece，在构建和应用Transformer模型中的作用，涵盖理论和实践两方面。