Implementing Conversation Buffer Memory

The most direct way to give a language model a sense of history is to simply show it the most recent parts of the conversation with every new request. This technique, known as conversation buffer memory, stores a verbatim transcript of the interaction. When you need to generate a new response, you provide the LLM with this running history, allowing it to understand the context of the latest user input.

The memory module provides the ConversationBuffer class to manage this process automatically. It acts as a sliding window over your conversation, keeping track of messages and ensuring the history doesn't grow indefinitely.

Initializing the Conversation Buffer

You create a buffer by instantiating the ConversationBuffer class. Its constructor accepts several parameters to control its behavior, the most important of which is max_messages. This parameter sets a limit on how many messages the buffer will retain, preventing the conversation history from consuming too much memory or exceeding an LLM's context window.

from kerb.memory import ConversationBuffer

# Create a buffer that stores up to 100 messages
buffer = ConversationBuffer(max_messages=100)

In this example, the buffer is configured to hold the last 100 messages. Once the 101st message is added, the oldest message (the very first one) is automatically removed to make space. This "first-in, first-out" behavior is fundamental to how the buffer manages its size.

When a conversation buffer reaches its max_messages limit, the oldest message is pruned to make room for the newest one, maintaining a fixed-size history.

Adding Messages to the Buffer

To build the conversation history, you use the add_message method. This method requires a role (typically "user" or "assistant") and the content of the message. Many applications also start a conversation with a "system" message to define the LLM's persona or instructions.

Let's simulate a short conversation with a Python programming assistant.

# System message to set the context
buffer.add_message(
    "system",
    "You are a helpful AI assistant specialized in Python programming."
)

# First user query
buffer.add_message(
    "user",
    "Can you explain async/await in Python?"
)

# Assistant's response
buffer.add_message(
    "assistant",
    "async/await in Python allows you to write asynchronous code. The 'async def' "
    "keyword defines a coroutine, and 'await' pauses execution until the awaited "
    "operation completes."
)

# Follow-up user query
buffer.add_message(
    "user",
    "What libraries would you recommend for async web scraping?"
)

print(f"Total messages in buffer: {len(buffer.messages)}")

After these calls, the buffer contains four Message objects, each storing the role, content, and a timestamp. This history is now ready to be used as context for the next LLM call.

Retrieving History for Context

When the user sends a new message, you need to provide the LLM with the recent conversation history so it can generate a relevant response. ConversationBuffer offers two primary methods for this: get_recent_messages and get_context.

The get_recent_messages method retrieves a list of the most recent Message objects from the buffer. This is useful when you need to construct the prompt payload for an API call yourself.

# Get the last 3 messages from the buffer
recent_messages = buffer.get_recent_messages(count=3)

for msg in recent_messages:
    print(f"{msg.role}: {msg.content[:70]}...")

This would output the last assistant response and the last two user messages.

For greater convenience, the get_context method returns a single formatted string containing the recent conversation history. This string is designed to be directly inserted into an LLM prompt.

# Get the recent history as a formatted string
context_string = buffer.get_context(include_summary=False)

print(context_string)

The output is a clean, human-readable transcript that the LLM can easily parse:

system: You are a helpful AI assistant specialized in Python programming.
user: Can you explain async/await in Python?
assistant: async/await in Python allows you to write asynchronous code. The 'async def' keyword defines a coroutine, and 'await' pauses execution until the awaited operation completes.
user: What libraries would you recommend for async web scraping?

You would then append the user's latest query to this context string before sending it to the generation model. This ensures the model has a complete view of the immediate conversational history. While effective for short-to-medium conversations, this approach of storing and sending the full history can become inefficient. As a conversation grows, the token count of the buffer's context can exceed the model's limit, leading to errors or costly truncation. The next section explores summary memory, a technique to manage this challenge in longer dialogues.

Was this section helpful?

References

Chat Completions API, OpenAI, 2024 (OpenAI) - Official documentation illustrating how to structure messages with roles (system, user, assistant) to provide conversational context to large language models, implicitly covering the concept of a context window.
Introduction to Algorithms, Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, 2022 (MIT Press) - Chapter 10, 'Elementary Data Structures,' explains queues and their First-In, First-Out (FIFO) behavior, which is the underlying mechanism for conversation buffer memory's fixed-size, sliding window approach.