Using Windowed and Token-Based Memory

Storing the entire conversation history using ConversationBufferMemory presents a straightforward method. However, this approach introduces a significant risk: as a conversation grows, the history object expands. Ultimately, the accumulated context can exceed the language model's token limit, leading to an API error and application failure. An alternative, ConversationSummaryMemory, addresses this token limit problem by summarizing past interactions. Despite solving the token overflow, this method introduces its own trade-offs, including the latency and cost of an additional language model call for summarization.

Between these two extremes lies a set of more pragmatic strategies that manage a fixed-size history. These methods offer a balance between context retention and resource management by keeping only the most recent parts of the conversation. They are fast, efficient, and prevent your application from failing due to an oversized context.

Managing Memory by Interaction Count: ConversationBufferWindowMemory

The simplest fixed-size strategy is to remember a set number of past interactions. This is handled by ConversationBufferWindowMemory. It works just like ConversationBufferMemory, but it only keeps the last k conversational turns in its history. A "turn" consists of the human's input and the AI's immediate response.

This approach is computationally inexpensive and provides a predictable ceiling on your context size. It is particularly effective for applications where the most recent exchanges are the most relevant, such as task-oriented bots where context from ten messages ago is unlikely to be useful.

Let's configure a chain with this memory type, setting k=2 to store only the last two interactions.

from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferWindowMemory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

# Define the prompt structure
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant for a travel agency."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}")
])

# Initialize memory with k=2
# This will store the last 2 Human/AI message pairs
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=2
)

# Create the conversation chain
conversation = ConversationChain(
    llm=llm,
    prompt=prompt,
    memory=memory
)

# --- Interaction 1 ---
print(conversation.invoke({"input": "Hi, I'm planning a trip. My name is George."}))
# AI: Hello George! Where are you thinking of traveling?

# --- Interaction 2 ---
print(conversation.invoke({"input": "I'd like to go to Paris."}))
# AI: Paris is a wonderful choice! Are you looking for recommendations on flights or hotels?

# --- Interaction 3 ---
print(conversation.invoke({"input": "I'm interested in flights."}))
# AI: Great, I can help with that. For what dates are you looking to book flights to Paris?

# --- Interaction 4 ---
# The first interaction ("My name is George") is now outside the window of k=2
# The agent will have forgotten the user's name.
print(conversation.invoke({"input": "Do you remember my name?"}))
# AI: I apologize, but I don't have your name stored in our conversation history. Could you please remind me?

In the final exchange, the model has no memory of the user's name. The first interaction, where "George" was mentioned, was pushed out of the conversation window once the third interaction began. The memory buffer at that point only contained the exchanges about Paris and flights. This demonstrates the primary trade-off: ConversationBufferWindowMemory is efficient but discards older context, regardless of its importance.

Controlling Memory by Token Count: ConversationTokenBufferMemory

Counting interactions with k is a good proxy for managing context size, but it is not precise. Some conversational turns may be very short ("Okay"), while others can be long paragraphs. A more accurate method for avoiding API limits is to manage the history based on the number of tokens.

ConversationTokenBufferMemory allows for this fine-grained control. Instead of a k value, you specify a max_token_limit. The memory will retain the most recent messages that fit within this token budget. As new messages are added, it will prune the oldest messages from the beginning of the conversation until the total token count is back under the limit.

A notable implementation detail is that this memory type requires access to the LLM instance to correctly calculate the token count of the messages, as tokenization schemes can vary between models.

from langchain.memory import ConversationTokenBufferMemory

# Note that we pass the llm object to the memory
# This is required for it to count tokens accurately.
token_memory = ConversationTokenBufferMemory(
    llm=llm,
    memory_key="chat_history",
    return_messages=True,
    max_token_limit=200  # Set a token limit for the history
)

# The chain is created the same way
token_conversation = ConversationChain(
    llm=llm,
    prompt=prompt,
    memory=token_memory
)

# --- Interaction 1 (short) ---
print(token_conversation.invoke({
    "input": "Hi, I'm Sarah, and I want to book a trip to Tokyo."
}))

# --- Interaction 2 (long) ---
print(token_conversation.invoke({
    "input": "Can you give me a detailed list of recommended activities? "
             "I'm interested in historical sites, modern architecture, "
             "local food markets, and maybe a technology museum. "
             "Please provide some descriptions for each place."
}))

# --- Interaction 3 (short) ---
# After the long response from the AI for the previous turn,
# the total token count will be high. The oldest messages
# about the user's name might be pruned to stay under the limit.
print(token_conversation.invoke({
    "input": "What city was I interested in?"
}))
# AI: You were interested in visiting Tokyo!

print(token_conversation.invoke({
    "input": "And what was my name?"
}))
# AI might respond: I'm sorry, I don't recall you mentioning your name.
# Or it might remember, depending on the exact token counts.

In this scenario, the agent remembers "Tokyo" because it was part of a more recent, substantial exchange. However, the very first message containing "Sarah" might be dropped if the combined length of the second interaction and its response exceeds the max_token_limit. This method provides protection against token overflow errors while retaining the most recent context.

Comparing Memory Pruning Strategies

The differences between these fixed-size memory types and the basic buffer can be visualized by observing which parts of a conversation they retain over time.

The diagram shows how different memory types prune a conversation history. BufferMemory keeps all six messages. BufferWindowMemory with k=2 keeps only the last four messages (two user/AI turns). TokenBufferMemory also prunes the oldest messages to stay within its budget, which in this case also results in keeping the last four.

Choosing Your Memory Strategy

The right memory component depends entirely on your application's needs:

ConversationBufferMemory: Best for short, self-contained conversations or demos where you can be certain the context window will not be exceeded.
ConversationBufferWindowMemory: Ideal for applications where only the most recent interactions matter. It's fast, simple, and predictable, making it a great default for many task-oriented chatbots.
ConversationTokenBufferMemory: The preferred choice when you need to strictly manage token usage to prevent API errors and control costs, while still focusing on recent context. It is more precise than windowed memory.
ConversationSummaryMemory: Use this when retaining long-term context is a requirement for the application's logic, and the added latency and cost of summarization are acceptable.

By selecting the appropriate memory strategy, you can build conversational applications that are both effective and efficient, providing a coherent user experience without exhausting your token budget.

Build LLM apps faster with Kerb

Cleaner syntax. Built-in debugging. Production-ready from day one.

Built for the AI systems behind ApX Machine Learning

Was this section helpful?

References

LangChain: Memory, LangChain Developers, 2023 (LangChain Inc.) - Provides the official documentation and usage examples for LangChain's memory modules, including windowed and token-based strategies.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems (NeurIPS) 30 DOI: 10.48550/arXiv.1706.03762 - The foundational paper introducing the Transformer architecture, which clarifies the context window limitations of modern LLMs.
Speech and Language Processing (3rd ed. draft) - Chapter 27: Dialogue Systems, Daniel Jurafsky and James H. Martin, 2023 - A comprehensive chapter on dialogue systems, explaining how conversation history and context are managed in conversational AI applications.