Large Language Models (LLMs), despite their power, have a finite attention span, often referred to as the context window. This window represents the maximum amount of text (measured in tokens, roughly corresponding to words or parts of words) that the model can consider at once when processing input and generating output. Think of it as the model's short-term memory.
In a standard LLM interaction, this limit applies to the user's prompt and the model's generated response. However, in a RAG system, we intentionally add more information into the prompt: the retrieved context chunks from our knowledge source. This immediately presents a challenge: the combined length of the original user query plus the retrieved text passages can easily exceed the LLM's context window capacity.
Imagine trying to stuff too many documents into a small filing cabinet. You either have to leave some documents out, cut some in half, or find a way to summarize them. Similarly, when the augmented prompt (query + retrieved chunks) is too long for the LLM, we need strategies to manage the overflow. If we simply send a prompt that exceeds the limit, the model might truncate the input arbitrarily, ignore the overflowing parts, or return an error, leading to incomplete or inaccurate responses.
Here are common strategies for managing context length limitations in RAG:
The most straightforward approach is simple truncation. If the combined text exceeds the limit, you cut off the least relevant parts. This often means removing chunks retrieved later (assuming the first few chunks returned by the retriever are the most relevant) or truncating the end of the last chunk included.
Instead of just taking the top N chunks based on initial retrieval scores, you might employ a more sophisticated selection process. This could involve:
Stricter Limits: Only include the absolute top K chunks, even if more were retrieved, ensuring K is small enough to fit.
Re-ranking: Use a secondary, potentially more computationally intensive model (a "re-ranker") to re-evaluate the relevance of the initially retrieved top N chunks specifically in the context of the query. Then, select the highest re-ranked chunks that fit within the window.
Query-Focused Selection: Prioritize chunks that have the highest semantic similarity specifically to the user's query, rather than just general relevance.
Pros: Aims to keep the most relevant information, reducing the chance of cutting off important context compared to simple truncation.
Cons: Can add complexity and latency (especially with re-ranking). Still involves discarding potentially useful information.
Another approach is to summarize the retrieved chunks before injecting them into the prompt. Instead of inserting several full text passages, you could use another LLM call (or a specialized summarization model) to create a concise summary of the retrieved information. This summary, along with the original query, is then fed to the main generator LLM.
A workflow showing how retrieved chunks can be summarized before being passed to the generator LLM to manage context length.
During the data preparation phase (Chapter 3), the way documents are chunked plays a role here. Using smaller chunk sizes means each individual piece of context is smaller, potentially allowing more distinct pieces to fit within the window. However, smaller chunks might lack sufficient context on their own. Finding the right chunk size often involves experimentation based on the LLM's context limit and the nature of the source documents.
Choosing the right strategy depends on several factors:
Often, a combination of approaches might be used. For example, retrieving slightly more chunks than needed, re-ranking them, and then truncating the least relevant ones from the re-ranked list to fit the context window.
Managing context length is a practical engineering challenge in building effective RAG systems. It requires balancing the desire to provide the LLM with comprehensive context against the hard limits of the model's architecture. Experimentation and evaluation (which we'll cover later) are essential to finding the optimal approach for your specific use case.
© 2025 ApX Machine Learning