Speech and Language Processing (3rd edition draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive resource covering natural language processing fundamentals, including text normalization, tokenization, and linguistic preprocessing techniques relevant for data cleaning.
Text splitters, LangChain, 2024 - Official documentation detailing various text chunking strategies and their implementations, including fixed-size, recursive, and overlap methods, offering practical guidance.
Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 2024arXiv preprint arXiv:2312.10997DOI: 10.48550/arXiv.2312.10997 - A survey on Retrieval-Augmented Generation (RAG) that addresses the importance of data preparation and chunking for effective retrieval in LLM-based systems.