Language models, by their design, operate on a per-request basis. Each time you send a prompt to an LLM, it processes that input in isolation, with no inherent memory of your previous interactions. Think of it like a brilliant but forgetful expert; you can ask any question, but you have to provide all the necessary background information every single time.
This stateless nature presents a significant obstacle when building applications that require continuous dialogue. A conversation is more than a series of unrelated questions and answers; it's a cumulative exchange where context builds over time. Without a mechanism to retain this context, the conversational flow breaks down.
For example, take this simple interaction:
User: My name is Alex. I'm interested in learning about machine learning.
Assistant: Hello! Machine learning is a fascinating topic. What specifically would you like to know?
User: What are the main types?
Assistant: The main types are supervised, unsupervised, and reinforcement learning. How can I help you today?
In the final turn, the assistant has already forgotten the user's name and the established topic of conversation. This forces the user to repeat information and makes the interaction feel disjointed and unnatural.
This behavior is a direct consequence of how LLM APIs are typically designed. Each call is an independent, stateless transaction. The model receives an input, processes it, generates an output, and then discards the state associated with that request. This architecture ensures scalability and predictability but places the burden of managing conversational context entirely on the developer's application.
Each API call is an isolated transaction. The model in Interaction 2 has no memory of what happened in Interaction 1.
To build a coherent conversation, our application must serve as the model's memory. The standard approach is to collect the history of the exchange and include it with every new user message. By sending a transcript of the dialogue, we provide the LLM with the necessary context to generate a relevant and stateful response.
While you could manage this with a simple Python list, this method becomes cumbersome. It is better to use a dedicated structure for managing conversation history. The ConversationBuffer class is designed for this very purpose.
from kerb.memory import ConversationBuffer
# Initialize a buffer to store the conversation
buffer = ConversationBuffer()
# Turn 1
buffer.add_message("user", "My name is Alex. I'm learning about machine learning.")
buffer.add_message("assistant", "Hello Alex! Machine learning is a great topic. Where should we start?")
# Turn 2
buffer.add_message("user", "What are the main types?")
# For a stateful response, we would now send the entire buffer history to the LLM.
# Here, we'll just add the expected response to the buffer.
buffer.add_message("assistant", "The main types are supervised, unsupervised, and reinforcement learning.")
# You can inspect the stored messages
print(f"Messages stored: {len(buffer.messages)}")
for msg in buffer.messages:
print(f"- {msg.role}: {msg.content}")
This method effectively solves the statelessness problem, but it introduces a new, significant constraint: the context window. Language models can only process a finite amount of text at once, a limit measured in tokens. As a conversation grows, the history we send with each request also grows. Eventually, the total number of tokens in the history plus the new user query will exceed the model's context window, resulting in an error. Furthermore, sending long histories with every request increases API costs and latency.
Effectively managing this trade-off between providing enough context and staying within token limits is a central challenge of building conversational applications. The following sections will explore different memory strategies to handle this, starting with the most direct approach: the conversation buffer.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with