An AI agent's immediate recall, its ability to remember what just happened or what its current specific instruction is, relies heavily on the Large Language Model's (LLM) context window. This window acts as the agent's short-term or working memory, holding the information the LLM can directly access for generating its next response or action. Effective prompt engineering is fundamental for managing this finite resource, ensuring the agent maintains coherence and focuses on the relevant details during multi-step operations.
Think of the LLM's context window like the RAM in a computer: it's fast, directly accessible memory, but it has a limited capacity. Every piece of information provided to an LLM in a single interaction. This includes system instructions defining the agent's role, the user's current query, a transcript of recent conversation turns, descriptions of available tools, and any intermediate thoughts or "scratchpad" notes. Must fit within this window.
The size of the context window is measured in tokens, which can roughly be thought of as words or parts of words. Models come with different context window sizes, for example, 4,096 tokens, 8,192 tokens, 32,768 tokens, or even larger. While larger windows offer more space, they are always finite. If the information stream exceeds this limit, older information is typically pushed out, leading to the agent "forgetting" earlier parts of the interaction. This can disrupt task continuity and lead to errors or irrelevant responses.
For instance, if an agent's context window is 4096 tokens:
Managing this budget effectively through prompt design is a core skill in building capable agents.
To make the most of an agent's short-term memory, you can employ several prompt engineering strategies. These techniques help ensure that the most relevant information stays within the LLM's attention span.
LLMs don't always weigh all parts of the prompt equally. Often, information at the very beginning (primacy effect) or the very end (recency effect) of the prompt has a stronger influence on the output.
System: You are a meticulous research assistant. Your goal is to gather information about renewable energy sources. Always cite your sources. Do not provide opinions, only factual data.
User: ... [previous conversation history summarized] ...
User: Now, based on the findings so far, identify the top three challenges in solar panel adoption in urban environments.
For an agent to hold a coherent conversation or execute a sequence of related tasks, it needs to remember what was said or done previously.
<system_prompt>
You are a helpful assistant.
</system_prompt>
<user_prompt>
<conversation_history_summary>
The user is planning a 7-day trip to Paris for 2 adults in spring. Budget is moderate. Interests include museums and historical sites.
</conversation_history_summary>
<recent_exchanges>
User: What about evening entertainment options?
Agent: For evening entertainment, Paris offers classical concerts, theatre shows, and Seine river cruises. Do any of these sound appealing?
User: The river cruise sounds nice. Can you find options that include dinner?
</recent_exchanges>
Current question: Find dinner cruise options on the Seine for two adults.
</user_prompt>
In this example, conversation_history_summary
is a condensed version of earlier turns, while recent_exchanges
keeps the last few interactions verbatim.You can instruct an agent to maintain a "scratchpad" or a "working_notes" section directly within its prompt. This section serves as a mutable space where the agent can jot down intermediate thoughts, calculations, observations, or refined plans.
The agent is prompted to read from and update this scratchpad as part of its reasoning process. After the agent generates its response (which includes the updated scratchpad), your controlling application code extracts this scratchpad and includes it in the prompt for the next turn.
Benefits:
Example Prompt Structure:
System: You are an AI assistant tasked with solving a multi-step problem. Use the <scratchpad> to think step-by-step and record important information.
User:
<problem_description>
[Detailed problem statement here]
</problem_description>
<scratchpad>
[Previous scratchpad content, or "Think step-by-step here..." for the first turn]
</scratchpad>
Based on the problem and your scratchpad, what is the next step or your final answer? Update your scratchpad.
The agent's output would include its reasoning and the updated content for <scratchpad>
, which is then fed back into the next prompt.
Using clear delimiters, such as XML-like tags (e.g., <history>
, <user_query>
, <available_tools>
) or Markdown headings, helps the LLM differentiate between various types of information within a complex prompt. This can improve the model's ability to parse the prompt accurately and focus on the sections most relevant to its current sub-task.
### Instructions
You are a helpful assistant.
### Available Tools
- search_web(query: string)
- get_weather(city: string)
### Conversation History
User: What's the weather like in London?
Agent: The weather in London is currently 15°C and cloudy.
User: Thanks! Now, can you search for popular tourist attractions there?
### Current Task
Based on the conversation, use the appropriate tool to find popular tourist attractions in London.
This structured approach aids the LLM in understanding the different components of its input and their roles.
While the above strategies manage information already introduced into the interaction, sometimes the agent needs specific pieces of information that it hasn't encountered yet or that were part of a much earlier, now summarized, interaction. Your application logic can proactively inject relevant context snippets into the prompt. For example, if the agent mentions a specific entity, the system can retrieve a brief definition or key facts about that entity from a knowledge base and add it to the prompt for the current turn, enriching the agent's short-term memory.
The following diagram illustrates how various pieces of information constitute the prompt, which in turn populates the LLM's context window for a given processing turn.
The diagram shows that the prompt, containing system instructions, user requests, history, tool data, and working notes, fills the LLM's context window. This window then informs the LLM's processing unit to generate an output.
The primary goal of these prompt strategies is to use the limited context window efficiently and prevent critical information from being lost. When the volume of information (system prompts, history, tool descriptions, current query) exceeds the LLM's token limit, the model will inevitably "forget" some of it, usually the oldest information in a simple truncation scenario. This loss can cause the agent to lose track of its goals, repeat previous actions, or fail to use relevant past information.
By actively managing what goes into the prompt each turn, summarizing, selecting, and structuring, you reduce the likelihood of such context overflow. These techniques are your first line of defense for maintaining coherent, stateful agent behavior over several turns of interaction.
While these methods are powerful for managing short-term memory, they are not a complete solution for long-term knowledge retention or recalling information from vast datasets. For those requirements, strategies involving external knowledge bases and retrieval mechanisms, discussed later in this chapter, become essential. However, a well-managed context window is the foundation upon which all other memory techniques build.
Was this section helpful?
© 2025 ApX Machine Learning