Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.) - Establishes the core Retrieval-Augmented Generation (RAG) architecture, which necessitates effective document preprocessing like chunking for efficient retrieval.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - Offers a thorough explanation of tokenization techniques, including BPE and WordPiece, essential for understanding token-based chunking in LLMs.
LangChain Text Splitters Documentation, Harrison Chase, 2023 - Documents various text splitting strategies, including fixed-size chunking with overlap, as implemented in a widely used RAG framework.
Retrieval-Augmented Generation for Large Language Models: A Survey, Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 2024arXiv preprint arXiv:2312.10997DOI: 10.48550/arXiv.2312.10997 - Presents a survey of Retrieval-Augmented Generation (RAG) methods, often discussing preprocessing steps like chunking and their influence on retrieval and generation.