Text Splitters, LangChain, 2024 (LangChain) - Practical guide to various text splitting strategies, including recursive splitting, for RAG applications.
NLTK Book, Chapter 3: Processing Raw Text, Steven Bird, Ewan Klein, and Edward Loper, 2009 (O'Reilly Media) - Comprehensive explanation of fundamental text processing techniques, including sentence tokenization.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - Authoritative textbook covering text segmentation, tokenization, and other preprocessing steps essential for information retrieval and natural language processing.
Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 2024arXiv preprint arXiv:2312.10997DOI: 10.48550/arXiv.2312.10997 - Provides a comprehensive survey of RAG, including discussion on document preprocessing strategies like chunking for effective retrieval.