The Challenge of Stateful Conversations

Language models, by their design, operate on a per-request basis. Each time you send a prompt to an LLM, it processes that input in isolation, with no inherent memory of your previous interactions. Think of it like a brilliant but forgetful expert; you can ask any question, but you have to provide all the necessary background information every single time.

This stateless nature presents a significant obstacle when building applications that require continuous dialogue. A conversation is more than a series of unrelated questions and answers; it's a cumulative exchange where context builds over time. Without a mechanism to retain this context, the conversational flow breaks down.

For example, take this simple interaction:

User: My name is Alex. I'm interested in learning about machine learning.

Assistant: Hello! Machine learning is a fascinating topic. What specifically would you like to know?

User: What are the main types?

Assistant: The main types are supervised, unsupervised, and reinforcement learning. How can I help you today?

In the final turn, the assistant has already forgotten the user's name and the established topic of conversation. This forces the user to repeat information and makes the interaction feel disjointed and unnatural.

The Stateless API Model

This behavior is a direct consequence of how LLM APIs are typically designed. Each call is an independent, stateless transaction. The model receives an input, processes it, generates an output, and then discards the state associated with that request. This architecture ensures scalability and predictability but places the burden of managing conversational context entirely on the developer's application.

Each API call is an isolated transaction. The model in Interaction 2 has no memory of what happened in Interaction 1.

To build a coherent conversation, our application must serve as the model's memory. The standard approach is to collect the history of the exchange and include it with every new user message. By sending a transcript of the dialogue, we provide the LLM with the necessary context to generate a relevant and stateful response.

While you could manage this with a simple Python list, this method becomes cumbersome. It is better to use a dedicated structure for managing conversation history. The ConversationBuffer class is designed for this very purpose.

from kerb.memory import ConversationBuffer

# Initialize a buffer to store the conversation
buffer = ConversationBuffer()

# Turn 1
buffer.add_message("user", "My name is Alex. I'm learning about machine learning.")
buffer.add_message("assistant", "Hello Alex! Machine learning is a great topic. Where should we start?")

# Turn 2
buffer.add_message("user", "What are the main types?")
# For a stateful response, we would now send the entire buffer history to the LLM.
# Here, we'll just add the expected response to the buffer.
buffer.add_message("assistant", "The main types are supervised, unsupervised, and reinforcement learning.")

# You can inspect the stored messages
print(f"Messages stored: {len(buffer.messages)}")
for msg in buffer.messages:
    print(f"- {msg.role}: {msg.content}")

This method effectively solves the statelessness problem, but it introduces a new, significant constraint: the context window. Language models can only process a finite amount of text at once, a limit measured in tokens. As a conversation grows, the history we send with each request also grows. Eventually, the total number of tokens in the history plus the new user query will exceed the model's context window, resulting in an error. Furthermore, sending long histories with every request increases API costs and latency.

Effectively managing this trade-off between providing enough context and staying within token limits is a central challenge of building conversational applications. The following sections will explore different memory strategies to handle this, starting with the most direct approach: the conversation buffer.

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) DOI: 10.48550/arXiv.1706.03762 - This foundational paper introduces the Transformer architecture, which forms the basis of modern LLMs. It elucidates how these models process input in a stateless, per-request manner.
Speech and Language Processing (3rd Edition Draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook on natural language processing and computational linguistics, with significant chapters dedicated to dialogue systems and the crucial aspects of managing conversational state and context.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vol. 33 (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) DOI: 10.48550/arXiv.2005.11401 - This paper presents Retrieval-Augmented Generation (RAG), a method for extending language models by retrieving information from an external knowledge base. This offers a strategy to manage and extend context beyond the limitations of the model's inherent context window, relevant to the conversation history problem.
A Survey of Large Language Models, Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, 2023 DOI: 10.48550/arXiv.2303.18223 - This comprehensive survey reviews the recent advancements and existing challenges in large language models. It covers various aspects, including the architectural design that leads to their stateless nature and the techniques used to build conversational applications.