Repeated calls to LLM APIs introduce latency and accumulate costs. A direct and effective strategy to mitigate this is caching. LLM response caching involves storing the result of a generation call and reusing it whenever the exact same request is made again. This avoids redundant API calls, leading to faster response times and significant cost savings, especially in applications with repetitive queries.
At the heart of any caching system is the cache key. This key is a unique identifier generated from the inputs of a function. For an LLM call, the output depends on more than just the prompt text. It is also influenced by the model name, temperature, and other generation parameters. An effective cache key must incorporate all these factors to prevent incorrect results. A request for a summary using gpt-4o-mini is different from one using claude-3-5-haiku, and both should have distinct cache entries.
The toolkit provides a utility function, generate_prompt_key, designed for this purpose. It creates a deterministic hash based on the prompt and any specified model parameters.
from kerb.cache import generate_prompt_key
same_prompt = "Explain Python programming"
# Same prompt, but different parameters result in different keys
key_gpt4 = generate_prompt_key(same_prompt, model="gpt-4", temperature=0.7)
key_gpt35 = generate_prompt_key(same_prompt, model="gpt-3.5-turbo", temperature=0.7)
key_temp1 = generate_prompt_key(same_prompt, model="gpt-4", temperature=1.0)
print(f"model=gpt-4, temp=0.7: {key_gpt4[:16]}...")
print(f"model=gpt-3.5-turbo, temp=0.7: {key_gpt35[:16]}...")
print(f"model=gpt-4, temp=1.0: {key_temp1[:16]}...")
As you can see, each unique combination of prompt and parameters generates a unique key, ensuring that we only reuse a cached response when the request is identical.
Implementing response caching follows a straightforward pattern: "check, then compute." Before making an API call, you check the cache for an existing entry with the corresponding identifier.
generate_prompt_key with the prompt and model parameters.The diagram below illustrates this flow, highlighting how caching bypasses expensive API calls.
The caching logic intercepts requests. A cache hit returns a stored response directly, while a miss proceeds to the API and stores the new result.
Let's see this in practice. First, we create an in-memory cache, which is fast and simple for single-session applications.
from kerb.cache import create_memory_cache
# Create a simple in-memory cache
cache = create_memory_cache(max_size=100)
Now, we can implement the workflow for a given prompt.
# Assume mock_llm_api_call exists and returns a dictionary with the response and cost
# from some_module import mock_llm_api_call
prompt = "What is the weather like today?"
model_params = {"model": "gpt-4", "temperature": 0.7}
# 1. Generate
key = generate_prompt_key(prompt, **model_params)
# 2. Check the cache
cached_response = cache.get(key)
if cached_response:
# 3. Cache Hit
print("✓ Cache hit - no API call needed!")
response = cached_response
else:
# 4. Cache Miss
print("✗ Cache miss - calling API")
response = mock_llm_api_call(prompt, **model_params)
# 5. Store the new response
cache.set(key, response)
print(f"Response: {response['response']}")
This pattern is effective but can add boilerplate code to every function that calls an LLM. A better approach is to encapsulate this logic in a dedicated client class.
For cleaner application code, you can build a wrapper class around your LLM client that handles caching automatically. This class will manage the cache instance and implement the check-then-compute logic inside its generation method.
Here is an example of a CachedLLMClient that provides a generate method with built-in caching.
class CachedLLMClient:
"""LLM client with automatic caching."""
def __init__(self):
self.cache = create_memory_cache()
self.api_calls = 0
self.cache_hits = 0
def generate(self, prompt, model="gpt-4", temperature=0.7, **kwargs):
"""Generate response with automatic caching."""
# Generate cache identifier from all relevant parameters
key = generate_prompt_key(
prompt=prompt,
model=model,
temperature=temperature,
**kwargs
)
# Check cache
cached = self.cache.get(key)
if cached:
self.cache_hits += 1
return cached["response"]
# If miss, call the actual API
self.api_calls += 1
response = mock_llm_api_call(prompt, model, temperature, **kwargs)
# Store the new response in the cache
self.cache.set(key, response)
return response["response"]
def stats(self):
"""Get usage statistics."""
total = self.api_calls + self.cache_hits
hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
return {
"total_requests": total,
"api_calls": self.api_calls,
"cache_hits": self.cache_hits,
"hit_rate": f"{hit_rate:.1f}%"
}
# Using the client
client = CachedLLMClient()
print("Using CachedLLMClient:")
client.generate("What is machine learning?")
client.generate("What is machine learning?") # This will be a cache hit
client.generate("Explain neural networks")
client.generate("What is machine learning?") # This will be another cache hit
# Show statistics
stats = client.stats()
print(f"\nClient Statistics:")
print(f" Total requests: {stats['total_requests']}")
print(f" API calls: {stats['api_calls']}")
print(f" Cache hits: {stats['cache_hits']}")
print(f" Hit rate: {stats['hit_rate']}")
This client made four requests but only two actual API calls, achieving a 50% hit rate. In applications with highly repetitive queries, this hit rate can be much higher, leading to dramatic improvements in performance and cost.
A major benefit of caching is cost reduction. You can quantify these savings by storing the cost of each API call along with its response. When a cache hit occurs, you can log the saved amount.
The set method on a cache instance accepts optional metadata for storing extra information that isn't part of the cached value itself. This is a perfect place to store the cost of the original API call.
# In our CachedLLMClient or manual workflow...
# When a cache miss occurs:
response = mock_llm_api_call(prompt, model)
# Store the response and its associated cost in the metadata
cache.set(key, response, metadata={"cost": response["cost"]})
# When a cache hit occurs:
cached_entry = cache.get_entry(key) # get_entry retrieves value and metadata
if cached_entry:
response = cached_entry.value
saved_cost = cached_entry.metadata.get("cost", 0.0)
total_saved += saved_cost
By tracking total_saved, you can directly measure the financial impact of your caching implementation. In a production system, this data is invaluable for monitoring operational expenses and demonstrating the ROI of performance optimizations.
The kerb.cache module offers several storage backends for different use cases.
MemoryCache: An in-memory cache, created with create_memory_cache(). It's extremely fast but volatile, meaning the cache is cleared when the application restarts. It is ideal for serving state within a single running process, like a web server.
DiskCache: A file-system-based cache, created with create_disk_cache(). It persists data across application restarts by writing to disk. While slower than MemoryCache, it's useful for command-line tools or batch jobs that need to reuse results from previous runs.
# Create a cache that stores data in a local .cache/llm directory
disk_cache = create_disk_cache(cache_dir=".cache/llm", serializer="json")
TieredCache: This backend, created with create_tiered_cache(), combines MemoryCache and DiskCache. It provides the speed of an in-memory cache for frequently accessed items while using a disk cache for persistence and as a larger, slower backup. This is often the best choice for production applications that require both high performance and data persistence.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with