Okay, let's put the theory of document chunking into practice. We've discussed why splitting large documents is necessary for RAG systems – primarily due to context window limitations of LLMs and the need for focused retrieval. Now, we'll use Python and common libraries to load and chunk text data.
We'll assume you have a working Python environment and have installed the necessary libraries. For these examples, we'll primarily use components from the LangChain library, which provides helpful abstractions for common RAG tasks. If you haven't installed it, you can typically do so via pip:
pip install langchain langchain_community
First, let's define some sample text content that we can work with. Imagine this text comes from a larger document about the fictional "AstroCortex" project.
# Sample document content
astro_cortex_text = """
Project AstroCortex: An Overview
AstroCortex is a groundbreaking initiative aimed at mapping the neural pathways of cosmic entities. Our primary objective is to understand how these beings perceive and interact with the universe's fundamental forces. The project integrates quantum physics, astrobiology, and computational neuroscience.
Methodology involves deploying hyperspace probes equipped with neural scanners. These probes collect data signatures, which are then transmitted back to our Earth-based Deep Thought Cluster for analysis. Initial data suggests complex, non-linear thought patterns vastly different from terrestrial life. We are using advanced machine learning models to decode these patterns. Challenges include signal degradation over vast distances and the unique nature of the data, requiring novel analytical techniques.
Ethical considerations are significant. Communication attempts are strictly non-invasive, adhering to the Prime Directive Protocols established in 2267. Ensuring the autonomy and safety of any discovered intelligence is our highest priority. The potential benefits include unparalleled insights into the universe's structure and our place within it. Funding is provided by the Interstellar Science Foundation and various private benefactors. The project timeline spans several decades, reflecting the complexity and scale of the undertaking. Future phases involve attempts at establishing rudimentary communication based on mathematical principles.
"""
# In a real scenario, you might load this from a file.
# For simplicity, we'll work with this string directly.
# If loading from a .txt file named 'astrocortex.txt':
# from langchain_community.document_loaders import TextLoader
# loader = TextLoader('astrocortex.txt')
# documents = loader.load() # This returns a list of Document objects
# astro_cortex_text = documents[0].page_content # Extract text if loaded from file
The most straightforward approach is to split the text into chunks of a fixed character length. This method is simple but can sometimes awkwardly split sentences or ideas. We'll use LangChain's CharacterTextSplitter
for this. An important parameter here is chunk_overlap
, which defines how many characters from the end of one chunk are repeated at the beginning of the next. This overlap helps maintain context across chunk boundaries.
from langchain.text_splitter import CharacterTextSplitter
# Initialize the splitter
char_splitter = CharacterTextSplitter(
separator="\n\n", # How to first try splitting text (e.g., by paragraph)
chunk_size=250, # Desired maximum chunk size (in characters)
chunk_overlap=50, # Number of characters to overlap between chunks
length_function=len, # Function to measure chunk length (default is fine)
is_separator_regex=False, # Treat separator as a literal string
)
# Split the text
fixed_chunks = char_splitter.split_text(astro_cortex_text)
# Let's examine the chunks
print(f"Original text length: {len(astro_cortex_text)}")
print(f"Number of chunks created: {len(fixed_chunks)}\n")
for i, chunk in enumerate(fixed_chunks):
print(f"--- Chunk {i+1} (Length: {len(chunk)}) ---")
print(chunk)
print("-" * 20 + "\n")
Running this code will output the chunks. Notice how chunk_size
acts as a maximum limit, and the splitter tries to respect the separator
first. Pay attention to the end of one chunk and the beginning of the next to see the chunk_overlap
in action. For instance, the end of Chunk 1 might reappear at the start of Chunk 2, providing continuity.
While simple, fixed-size chunking might cut a sentence mid-thought if a natural breaking point (like a paragraph) isn't found near the chunk_size
limit.
A more commonly used and often more effective strategy is recursive character splitting. This method attempts to split text based on a prioritized list of separators. It starts with larger semantic units (like double newlines for paragraphs) and recursively splits smaller pieces if a chunk is still too large, using progressively smaller separators (single newline, space, etc.). This approach tends to keep related content together more effectively than simple fixed-size splitting. LangChain's RecursiveCharacterTextSplitter
implements this.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize the recursive splitter
recursive_splitter = RecursiveCharacterTextSplitter(
# Try splitting first by paragraphs, then sentences, then words
separators=["\n\n", "\n", " ", ""],
chunk_size=250,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
)
# Split the text
recursive_chunks = recursive_splitter.split_text(astro_cortex_text)
# Examine the results
print(f"Original text length: {len(astro_cortex_text)}")
print(f"Number of chunks created by recursive splitter: {len(recursive_chunks)}\n")
for i, chunk in enumerate(recursive_chunks):
print(f"--- Recursive Chunk {i+1} (Length: {len(chunk)}) ---")
print(chunk)
print("-" * 20 + "\n")
Compare the output of the RecursiveCharacterTextSplitter
with the CharacterTextSplitter
. You'll likely observe that the recursive method does a better job of preserving paragraph and sentence structure within the chunks, even with the same chunk_size
target. It prioritizes splitting along \n\n
(paragraphs), then \n
(lines), then spaces, which usually leads to more semantically coherent chunks.
Which chunking strategy is best?
CharacterTextSplitter
): Simple, predictable chunk lengths. Can be less effective at preserving semantic meaning if it splits mid-sentence frequently.RecursiveCharacterTextSplitter
): Generally preferred. Tries to maintain semantic boundaries (paragraphs, sentences) leading to more coherent chunks, though chunk size might vary slightly more.The ideal chunk_size
and chunk_overlap
depend heavily on:
Experimentation is often needed. Try different splitters and parameters, then evaluate how well the resulting chunks perform in your RAG pipeline during retrieval (which we'll cover soon, and evaluate in Chapter 6).
This practical exercise demonstrates how to transform raw text into processed chunks. The next step in preparing data involves generating vector embeddings for these chunks and storing them in a vector database, making them searchable by the retrieval component.
© 2025 ApX Machine Learning