Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020Advances in Neural Information Processing Systems, Vol. 33 (Neural Information Processing Systems Foundation, Inc.)DOI: 10.55917/cb.2023-1123 - Introduces the Retrieval-Augmented Generation (RAG) architecture, foundational for understanding the system data loading discussed.
Document loaders, LangChain, 2024 (LangChain) - Official documentation covering how LangChain handles loading diverse document types and representing them as Document objects, a common approach in RAG systems.
Ingestion Pipeline, LlamaIndex, 2024 (LlamaIndex) - Official guide detailing LlamaIndex's ingestion process, showing how raw data is converted into Node objects (similar to Document objects) with extracted metadata for RAG.
Speech and Language Processing, Daniel Jurafsky and James H. Martin, 2023 (Online (3rd Edition Draft)) - Chapter 10, "Information Extraction," from a leading NLP textbook, discussing methods for extracting structured information and metadata from text, relevant to data preparation for RAG.