As your applications grow in complexity, prompts are rarely just a single string of text. More often, you will construct prompts by combining multiple pieces of information: a system message to guide the model's behavior, the user's current query, a history of the conversation, and perhaps several documents retrieved from a knowledge base for a Retrieval-Augmented Generation (RAG) system. Managing the combined size of these components is a significant engineering challenge.
This is where the idea of a "token budget" becomes useful. A token budget is a predefined limit on the total number of tokens you can allocate to a prompt. By treating the context window as a budget, you can programmatically decide how to "spend" your available tokens across different parts of the prompt, ensuring you never exceed the model's limit while making the best use of the available space.
Imagine you have a 4096-token context window. Before you even start adding content, it's a best practice to reserve a portion of this budget for the model's response. If you use the entire window for the input, the model will have no room to generate an answer. A common approach is to reserve 25-50% of the total budget for the output.
The remaining budget is then distributed among your input components.
Different parts of a prompt consume tokens from a fixed budget. A portion of the total context window is reserved for the model's generated output.
To manage token allocation systematically, we can use a helper class that tracks usage. The TokenBudgetManager helps you monitor how many tokens have been "spent" and how many remain.
Let's start by initializing a manager for a model with a 4096-token context window.
from kerb.tokenizer import count_tokens, Tokenizer, truncate_to_token_limit
from typing import List, Dict, Optional
from datetime import datetime
# This is a class provided for the example.
# In a real library, it would be imported directly.
class TokenBudgetManager:
def __init__(self, budget_limit: int):
self.budget_limit = budget_limit
self.tokens_used = 0
def check_budget(self, tokens_needed: int) -> bool:
return self.tokens_used + tokens_needed <= self.budget_limit
def use_tokens(self, tokens: int) -> bool:
if not self.check_budget(tokens):
return False
self.tokens_used += tokens
return True
def get_remaining_budget(self) -> int:
return self.budget_limit - self.tokens_used
# --- Example Usage ---
# Total context window for a model like gpt-3.5-turbo
TOTAL_CONTEXT_WINDOW = 4096
# Reserve 1024 tokens for the model's output
RESERVED_FOR_OUTPUT = 1024
INPUT_BUDGET = TOTAL_CONTEXT_WINDOW - RESERVED_FOR_OUTPUT
# Initialize the budget manager for our input prompt
budget = TokenBudgetManager(budget_limit=INPUT_BUDGET)
print(f"Total input budget: {budget.budget_limit} tokens")
Now, let's build a complex prompt for a RAG application. Our prompt will contain a system message, a user query, and a list of retrieved documents.
First, we add the fixed components and update our budget.
system_prompt = "You are an AI assistant. Use the provided documents to answer the user's question."
user_query = "What are the main strategies for effective token management in LLM applications?"
# Count and use tokens for the system prompt
system_tokens = count_tokens(system_prompt, tokenizer=Tokenizer.CL100K_BASE)
budget.use_tokens(system_tokens)
print(f"Used {system_tokens} tokens for system prompt. Remaining: {budget.get_remaining_budget()}")
# Count and use tokens for the user query
query_tokens = count_tokens(user_query, tokenizer=Tokenizer.CL100K_BASE)
budget.use_tokens(query_tokens)
print(f"Used {query_tokens} tokens for user query. Remaining: {budget.get_remaining_budget()}")
# Initialize the final context list
final_prompt_context = [system_prompt, user_query]
Next, we have a list of retrieved documents, sorted by relevance. We want to add as many of these as possible to our prompt without exceeding the budget. We can loop through the documents, check if the next one fits, and add it if it does.
retrieved_documents = [
"Document 1: Token counting is the first step. Use a tokenizer that matches your model, like cl100k_base for GPT models, to get an accurate count of tokens for any given text. This helps in estimating API costs and managing context windows.",
"Document 2: Truncation is a common strategy. If a document is too long, you can truncate it by preserving either the beginning or the end. For summaries, keep the beginning; for recent data, keep the end.",
"Document 3: A token budget helps manage complex prompts. You allocate parts of the context window to different components like the system prompt, user query, and retrieved documents, while always reserving space for the model's output.",
"Document 4: Context compression techniques like summarization can reduce the token count of lengthy documents while preserving the most important information, making it easier to fit more context into the prompt.",
"Document 5: For conversational history, sliding window strategies are effective. You can keep a fixed number of recent messages or a total number of tokens to ensure the conversation history doesn't grow indefinitely and exceed the context limit."
]
print(f"\nAdding retrieved documents to the prompt...")
for i, doc in enumerate(retrieved_documents):
doc_tokens = count_tokens(doc, tokenizer=Tokenizer.CL100K_BASE)
# Check if this document fits in the remaining budget
if budget.check_budget(doc_tokens):
budget.use_tokens(doc_tokens)
final_prompt_context.append(doc)
print(f"Added Document {i+1} ({doc_tokens} tokens). Remaining budget: {budget.get_remaining_budget()}")
else:
print(f"Document {i+1} ({doc_tokens} tokens) does not fit. Stopping.")
break
This simple loop effectively packs the prompt with the most relevant information that fits within our defined constraints.
What if a single, highly relevant document is too large to fit in the remaining budget? In the previous example, we would simply skip it. A better approach is to truncate the document so that at least part of it can be included.
The truncate_to_token_limit function is perfect for this. Let's modify our logic to include this step.
# Reset budget for a new example
budget = TokenBudgetManager(budget_limit=150) # A smaller budget for demonstration
budget.use_tokens(count_tokens(system_prompt))
budget.use_tokens(count_tokens(user_query))
final_prompt_context_truncated = [system_prompt, user_query]
print(f"\n--- Truncation Example ---")
print(f"Starting budget: {budget.budget_limit}, Remaining after prompts: {budget.get_remaining_budget()}")
for i, doc in enumerate(retrieved_documents):
doc_tokens = count_tokens(doc, tokenizer=Tokenizer.CL100K_BASE)
remaining_budget = budget.get_remaining_budget()
if remaining_budget <= 0:
print("No budget remaining. Stopping.")
break
if doc_tokens > remaining_budget:
# Document is too large, so we truncate it
print(f"Document {i+1} ({doc_tokens} tokens) is too large. Truncating to fit {remaining_budget} tokens.")
truncated_doc = truncate_to_token_limit(
doc,
max_tokens=remaining_budget,
tokenizer=Tokenizer.CL100K_BASE
)
final_prompt_context_truncated.append(truncated_doc)
# Use up the rest of the budget
actual_tokens = count_tokens(truncated_doc)
budget.use_tokens(actual_tokens)
print(f"Added truncated document ({actual_tokens} tokens). Remaining budget: {budget.get_remaining_budget()}")
# No more space left, so we stop
break
else:
# Document fits perfectly
budget.use_tokens(doc_tokens)
final_prompt_context_truncated.append(doc)
print(f"Added Document {i+1} ({doc_tokens} tokens). Remaining budget: {budget.get_remaining_budget()}")
# Finally, combine all parts into a single string to send to the LLM
final_prompt = "\n\n".join(final_prompt_context_truncated)
By combining token counting, budget management, and dynamic truncation, you can build systems that automatically and intelligently manage the LLM's context window. This prevents runtime errors from oversized prompts, helps control API costs, and ensures that the most valuable information is always included in the context provided to the model.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with