Imagine you're having a conversation. You naturally remember what was said recently, but details from much earlier might fade. Large Language Models have something similar, a limited memory called the context window.
The context window is the amount of text the model can "see" or "remember" when generating its next response. This includes both your most recent prompt and the preceding parts of the conversation (or the document you provided). Think of it as the model's short-term memory. Everything you input and everything the model outputs contributes to filling this window.
Remember back in Chapter 1, we discussed tokens? Tokens are the basic units of text that LLMs process, often representing parts of words, whole words, or punctuation. The size of the context window is typically measured in the number of tokens it can hold. A model might have a context window of 2048 tokens, 4096 tokens, or even much larger numbers like 32,000 tokens or more. A larger number means the model can consider more text at once.
The size of the context window directly impacts how well the model can handle certain tasks:
Think of the context window like a fixed-size sliding window moving over your conversation history. As new text (your prompt or the model's response) is added, the window slides forward. If the total text exceeds the window size, the oldest text falls out of view.
Let's illustrate this with a simplified example. Imagine a model with a very small context window, say, enough for only 15 words.
Interaction:
In this simplified case, because "cat" fell out of the small context window, the model couldn't recall it. Real LLMs have much larger windows, but the principle is the same.
A simplified representation showing how, at Turn 3, the information from Turn 1 might fall outside the limited context window, leading the model to only recall information from Turn 2 and Turn 3.
Different models come with different context window sizes. When you were learning about finding models (Chapter 3), you might have seen specifications like "4k context" or "32k context". This refers to the approximate number of tokens (4096 or 32768, respectively) the model supports. Larger context windows generally require more computational resources (especially RAM and VRAM, as discussed in Chapter 2).
For basic chats, even a moderately sized context window (e.g., 4096 tokens) is often sufficient. However, if you plan to work with very long documents or have extended, complex dialogues, choosing a model with a larger context window becomes more important.
You generally don't need to manually count tokens. The tools you use (like Ollama or LM Studio, covered in Chapter 4) typically manage the context window automatically, trimming the oldest parts of the conversation as needed. However, being aware of this limit helps you understand why a model might sometimes seem to "forget" things you mentioned earlier in a long chat. If you notice this happening, you might need to remind the model by re-stating significant information in your prompt.
Understanding the context window is fundamental to effective prompting. It helps explain the model's conversational memory limits and guides you in structuring interactions, especially as they become longer or more complex.
© 2025 ApX Machine Learning