As we established, Large Language Models (LLMs) operate based on the knowledge encoded during their training, making them unable to access information created afterward or private data. Retrieval Augmented Generation (RAG) provides a mechanism to bridge this gap by connecting LLMs to external data sources. The first practical step in building a RAG system is getting this external data into a format suitable for processing. This involves two main stages: loading the documents and splitting them into manageable pieces.
Before an LLM can leverage external information, that information must first be loaded into our application. Documents come in various formats and reside in different locations. Common sources include:
Specialized tools, often called DocumentLoaders
, are typically used to handle the specifics of reading different file types or accessing data sources. These loaders abstract away the complexities of parsing various formats. For instance, a PDFLoader
would handle extracting text from a PDF file, while a WebBaseLoader
might fetch and parse HTML content from a URL.
Conceptually, using a loader might look something like this in Python:
# Hypothetical example - specific libraries may differ
# from some_library.document_loaders import PDFLoader
# Initialize a loader for a specific file type
# loader = PDFLoader("path/to/your_report.pdf")
# Load the document content
# documents = loader.load()
# 'documents' usually becomes a list, where each element
# represents a part of the source (e.g., a page in a PDF)
# and contains the text content along with metadata.
# print(documents[0].page_content) # The text of the first page
# print(documents[0].metadata) # {'source': 'path/to/your_report.pdf', 'page': 0}
A significant aspect of document loading is preserving metadata. Metadata refers to information about the data, such as the original filename, page number, author, creation date, or web URL. Retaining this metadata alongside the document content is important because it can be used later in the RAG process. For example, when the LLM generates an answer based on retrieved content, the metadata allows you to cite the original source (e.g., "According to page 5 of 'your_report.pdf', ...").
Once documents are loaded, they are often too large to be processed effectively by downstream components, particularly the LLM itself. LLMs have a finite context window, which is the maximum amount of text (measured in tokens) they can consider at one time. Feeding an entire book or even a long report directly into an LLM prompt is usually infeasible.
Furthermore, for the "Retrieval" part of RAG to work well, we need to search over smaller, more focused pieces of text. If a user asks a specific question, retrieving an entire chapter is less helpful than retrieving the exact paragraph containing the answer.
Therefore, after loading, the next step is to split the large documents into smaller chunks. This process is also known as chunking. The primary reasons for splitting are:
There isn't a single perfect way to split documents; the best strategy often depends on the document type and the specific application. Here are common approaches:
Fixed-Size Chunking: The simplest method involves splitting the text into chunks of a predetermined character length (e.g., every 1000 characters). While easy to implement, this method can awkwardly cut sentences or even words in half, potentially losing semantic meaning at the boundaries.
Fixed-Size Chunking with Overlap: To mitigate the issue of cutting off context, adjacent chunks can be made to overlap. For example, a chunk might contain characters 1-1000, the next 900-1900, the next 1800-2800, and so on. This overlap (100 characters in this case) ensures that information near the chunk boundaries is present in two consecutive chunks, reducing the chance of losing context relevant to a query that spans a boundary.
Splitting a document into overlapping chunks. The overlap helps preserve context that might otherwise be lost at the split points.
Content-Aware Splitting: These methods attempt to split text based on its structure or semantic meaning. Examples include:
\n\n
), sentence-ending punctuation (.
, ?
, !
), or custom markers.\n\n
), then single newlines (\n
), then spaces (
). This helps keep paragraphs and sentences together as much as possible while still adhering to size limits.Semantic Chunking: More advanced methods use NLP models (sometimes even smaller embedding models) to identify points in the text where the topic shifts, aiming to create chunks that are semantically coherent units of meaning.
The ideal chunk size and splitting strategy depend on several factors:
Finding the optimal chunking parameters (like size and overlap) usually involves experimentation. You might test different configurations and evaluate the quality of the retrieval results and the final LLM responses.
Frameworks often provide TextSplitter
classes that implement various strategies. Using one might look conceptually like this:
# Hypothetical example - specific libraries may differ
# from some_library.text_splitter import RecursiveCharacterTextSplitter
# Assume 'documents' is the list loaded earlier
# text_splitter = RecursiveCharacterTextSplitter(
# chunk_size=1000, # Target size for each chunk
# chunk_overlap=150, # Number of characters to overlap between chunks
# separators=["\n\n", "\n", ".", " ", ""] # Order of separators to try
# )
# Split the loaded documents into smaller chunks
# chunks = text_splitter.split_documents(documents)
# 'chunks' is now a list of smaller text pieces,
# each still potentially associated with its original metadata.
# print(f"Split {len(documents)} document(s) into {len(chunks)} chunks.")
# print(chunks[0].page_content) # Content of the first chunk
# print(chunks[0].metadata) # Metadata inherited/adapted from the original document
It's important that the splitting process preserves or appropriately adapts the metadata from the original documents for each new chunk. This ensures that even after splitting, you can trace a piece of text back to its source.
Loading and splitting are the essential preparatory steps in a RAG pipeline. By transforming raw external data into standardized, manageable chunks with associated metadata, we set the stage for the next phase: creating vector embeddings to enable semantic search and retrieval.
© 2025 ApX Machine Learning