Counting Tokens with the Tokenizer

To effectively manage an LLM's context window, you first need a precise way to measure the size of your input. While we often think in terms of words or characters, LLMs operate on a different unit: tokens. A token is a piece of a word, which could be a whole word, a sub-word, or even just a single character or punctuation mark. For example, the word "tokenization" might be split into two tokens: "token" and "ization".

Understanding and accurately counting tokens is fundamental for building reliable LLM applications. It allows you to avoid exceeding a model’s context limit, which would cause an API error, and helps you manage costs, which are directly tied to the number of input and output tokens.

The tokenizer module provides the count_tokens function, a straightforward utility for calculating the token count of any given text.

Using count_tokens

The most direct way to get a token count is to pass your text to the count_tokens function. It's important to specify which tokenizer to use, as different models and model families use different tokenization schemes. The Tokenizer enum provides access to common tokenizers. For modern OpenAI models like GPT-4 and GPT-3.5-Turbo, CL100K_BASE is the correct choice.

from kerb.tokenizer import count_tokens, Tokenizer

text = "This is a sample sentence for token counting."

token_count = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)

print(f"Text: '{text}'")
print(f"Character count: {len(text)}")
print(f"Token count: {token_count}")

Notice that the token count is not the same as the word count. This ratio of tokens to characters or words varies depending on the text's structure, language, and complexity.

Selecting the Right Tokenizer

Using the correct tokenizer for your target model is significant for accuracy. A mismatch can lead to incorrect counts, causing unexpected errors or higher costs. The Tokenizer enum includes several options for different model families. For instance, P50K_BASE is used by older OpenAI models and some code-specialized models.

Let's compare the token counts for the same text using different tokenizers.

from kerb.tokenizer import count_tokens, Tokenizer

text = "Large language models have revolutionized AI."

# For GPT-4, GPT-3.5-Turbo
tokens_cl100k = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)
print(f"CL100K_BASE count: {tokens_cl100k}")

# For older code models
tokens_p50k = count_tokens(text, tokenizer=Tokenizer.P50K_BASE)
print(f"P50K_BASE count: {tokens_p50k}")

Even for this short sentence, the token counts differ. While the difference may seem small, it can accumulate to hundreds or thousands of tokens for large documents, impacting both performance and cost. As a rule of thumb, always match the tokenizer to the model you are using.

Fast Approximations for Quick Checks

The official tokenizers provide perfect accuracy but require loading specific encoding files. In scenarios where you need a very fast, lightweight estimate without external dependencies, you can use an approximation method. The library provides several heuristics for this purpose. A common one is CHAR_4, which estimates tokens based on the rule that one token is roughly four characters for typical English text.

from kerb.tokenizer import count_tokens, Tokenizer

text = "This is a long sentence used for demonstrating the trade-offs between accurate tokenization and fast approximation methods for quick checks."

# Accurate count
accurate_count = count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)
print(f"Accurate count (CL100K_BASE): {accurate_count}")

# Fast approximation
approx_count = count_tokens(text, tokenizer=Tokenizer.CHAR_4)
print(f"Approximated count (CHAR_4): {approx_count}")

Approximations are useful for client-side validation or high-throughput systems where a rough estimate is sufficient, but for final validation before an API call, using the model-specific tokenizer is always recommended.

Checking an API Call

Let's apply this to a scenario. Before sending a request to an LLM, you can check if your combined prompt fits within the model's context window. This preventative step makes your application stronger.

Imagine you are building a chatbot using a model with a 4096-token context window. You decide to reserve 1000 tokens for the model's response, leaving 3096 tokens for your input, which includes a system prompt and the user's message.

from kerb.tokenizer import count_tokens, Tokenizer

# Application settings
CONTEXT_LIMIT = 4096
RESERVED_FOR_COMPLETION = 1000
INPUT_TOKEN_BUDGET = CONTEXT_LIMIT - RESERVED_FOR_COMPLETION

# Prompts
system_prompt = "You are a helpful assistant that explains complex topics simply."
user_input = "Can you provide a detailed explanation of Retrieval-Augmented Generation (RAG) and how it helps reduce hallucinations in large language models?"

# Count tokens for each part of the prompt
system_tokens = count_tokens(system_prompt, tokenizer=Tokenizer.CL100K_BASE)
user_tokens = count_tokens(user_input, tokenizer=Tokenizer.CL100K_BASE)
total_input_tokens = system_tokens + user_tokens

print(f"System prompt tokens: {system_tokens}")
print(f"User input tokens: {user_tokens}")
print(f"Total input tokens: {total_input_tokens}")
print(f"Input token budget: {INPUT_TOKEN_BUDGET}")

if total_input_tokens > INPUT_TOKEN_BUDGET:
    print("\nWARNING: Input exceeds the available token budget!")
else:
    print("\nOK: Input fits within the available token budget.")

By performing this check, you can decide whether to proceed with the API call or to first apply a truncation strategy, which we will cover in the next section. This simple validation is a building block for creating reliable and cost-effective LLM applications.

Was this section helpful?

References

tiktoken (GitHub repository), OpenAI, 2024 - Official tokenizer library by OpenAI that provides encoding for models like GPT-4 and GPT-3.5-Turbo, including cl100k_base and p50k_base encodings.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1 (Association for Computational Linguistics) DOI: 10.18653/v1/P16-1162 - The original paper introducing Byte Pair Encoding (BPE), a method for segmenting words into subword units, fundamental for many modern tokenizers.
Hugging Face tokenizers library, Hugging Face, 2024 (Hugging Face) - A general resource on the principles and implementations of various tokenizers used in transformer models, including BPE and WordPiece.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Taku Kudo, John Richardson, 2018 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics) DOI: 10.18653/v1/D18-2012 - Describes SentencePiece, a language-agnostic subword tokenization method used in many transformer models.