To effectively manage an LLM's context window, you first need a precise way to measure the size of your input. While we often think in terms of words or characters, LLMs operate on a different unit: tokens. A token is a piece of a word, which could be a whole word, a sub-word, or even just a single character or punctuation mark. For example, the word "tokenization" might be split into two tokens: "token" and "ization".
Understanding and accurately counting tokens is fundamental for building reliable LLM applications. It allows you to avoid exceeding a model’s context limit, which would cause an API error, and helps you manage costs, which are directly tied to the number of input and output tokens.
The tokenizer module provides the count_tokens function, a straightforward utility for calculating the token count of any given text.
The most direct way to get a token count is to pass your text to the count_tokens function. It's important to specify which tokenizer to use, as different models and model families use different tokenization schemes. The Tokenizer enum provides access to common tokenizers. For modern OpenAI models like GPT-4 and GPT-3.5-Turbo, CL100K_BASE is the correct choice.
from kerb.tokenizer import count_tokens, Tokenizer
text = "This is a sample sentence for token counting."
token_count = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)
print(f"Text: '{text}'")
print(f"Character count: {len(text)}")
print(f"Token count: {token_count}")
Notice that the token count is not the same as the word count. This ratio of tokens to characters or words varies depending on the text's structure, language, and complexity.
Using the correct tokenizer for your target model is significant for accuracy. A mismatch can lead to incorrect counts, causing unexpected errors or higher costs. The Tokenizer enum includes several options for different model families. For instance, P50K_BASE is used by older OpenAI models and some code-specialized models.
Let's compare the token counts for the same text using different tokenizers.
from kerb.tokenizer import count_tokens, Tokenizer
text = "Large language models have revolutionized AI."
# For GPT-4, GPT-3.5-Turbo
tokens_cl100k = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)
print(f"CL100K_BASE count: {tokens_cl100k}")
# For older code models
tokens_p50k = count_tokens(text, tokenizer=Tokenizer.P50K_BASE)
print(f"P50K_BASE count: {tokens_p50k}")
Even for this short sentence, the token counts differ. While the difference may seem small, it can accumulate to hundreds or thousands of tokens for large documents, impacting both performance and cost. As a rule of thumb, always match the tokenizer to the model you are using.
The official tokenizers provide perfect accuracy but require loading specific encoding files. In scenarios where you need a very fast, lightweight estimate without external dependencies, you can use an approximation method. The library provides several heuristics for this purpose. A common one is CHAR_4, which estimates tokens based on the rule that one token is roughly four characters for typical English text.
from kerb.tokenizer import count_tokens, Tokenizer
text = "This is a long sentence used for demonstrating the trade-offs between accurate tokenization and fast approximation methods for quick checks."
# Accurate count
accurate_count = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)
print(f"Accurate count (CL100K_BASE): {accurate_count}")
# Fast approximation
approx_count = count_tokens(text, tokenizer=Tokenizer.CHAR_4)
print(f"Approximated count (CHAR_4): {approx_count}")
Approximations are useful for client-side validation or high-throughput systems where a rough estimate is sufficient, but for final validation before an API call, using the model-specific tokenizer is always recommended.
Let's apply this to a scenario. Before sending a request to an LLM, you can check if your combined prompt fits within the model's context window. This preventative step makes your application stronger.
Imagine you are building a chatbot using a model with a 4096-token context window. You decide to reserve 1000 tokens for the model's response, leaving 3096 tokens for your input, which includes a system prompt and the user's message.
from kerb.tokenizer import count_tokens, Tokenizer
# Application settings
CONTEXT_LIMIT = 4096
RESERVED_FOR_COMPLETION = 1000
INPUT_TOKEN_BUDGET = CONTEXT_LIMIT - RESERVED_FOR_COMPLETION
# Prompts
system_prompt = "You are a helpful assistant that explains complex topics simply."
user_input = "Can you provide a detailed explanation of Retrieval-Augmented Generation (RAG) and how it helps reduce hallucinations in large language models?"
# Count tokens for each part of the prompt
system_tokens = count_tokens(system_prompt, tokenizer=Tokenizer.CL100K_BASE)
user_tokens = count_tokens(user_input, tokenizer=Tokenizer.CL100K_BASE)
total_input_tokens = system_tokens + user_tokens
print(f"System prompt tokens: {system_tokens}")
print(f"User input tokens: {user_tokens}")
print(f"Total input tokens: {total_input_tokens}")
print(f"Input token budget: {INPUT_TOKEN_BUDGET}")
if total_input_tokens > INPUT_TOKEN_BUDGET:
print("\nWARNING: Input exceeds the available token budget!")
else:
print("\nOK: Input fits within the available token budget.")
By performing this check, you can decide whether to proceed with the API call or to first apply a truncation strategy, which we will cover in the next section. This simple validation is a building block for creating reliable and cost-effective LLM applications.
Was this section helpful?
cl100k_base and p50k_base encodings.tokenizers library, Hugging Face, 2024 (Hugging Face) - A general resource on the principles and implementations of various tokenizers used in transformer models, including BPE and WordPiece.© 2026 ApX Machine LearningEngineered with