Implementing LLM Response Caching

Repeated calls to LLM APIs introduce latency and accumulate costs. A direct and effective strategy to mitigate this is caching. LLM response caching involves storing the result of a generation call and reusing it whenever the exact same request is made again. This avoids redundant API calls, leading to faster response times and significant cost savings, especially in applications with repetitive queries.

At the heart of any caching system is the cache key. This key is a unique identifier generated from the inputs of a function. For an LLM call, the output depends on more than just the prompt text. It is also influenced by the model name, temperature, and other generation parameters. An effective cache key must incorporate all these factors to prevent incorrect results. A request for a summary using gpt-4o-mini is different from one using claude-3-5-haiku, and both should have distinct cache entries.

The toolkit provides a utility function, generate_prompt_key, designed for this purpose. It creates a deterministic hash based on the prompt and any specified model parameters.

from kerb.cache import generate_prompt_key

same_prompt = "Explain Python programming"

# Same prompt, but different parameters result in different keys
key_gpt4 = generate_prompt_key(same_prompt, model="gpt-4", temperature=0.7)
key_gpt35 = generate_prompt_key(same_prompt, model="gpt-3.5-turbo", temperature=0.7)
key_temp1 = generate_prompt_key(same_prompt, model="gpt-4", temperature=1.0)

print(f"model=gpt-4, temp=0.7:         {key_gpt4[:16]}...")
print(f"model=gpt-3.5-turbo, temp=0.7: {key_gpt35[:16]}...")
print(f"model=gpt-4, temp=1.0:         {key_temp1[:16]}...")

As you can see, each unique combination of prompt and parameters generates a unique key, ensuring that we only reuse a cached response when the request is identical.

The Caching Workflow

Implementing response caching follows a straightforward pattern: "check, then compute." Before making an API call, you check the cache for an existing entry with the corresponding identifier.

Generate Cache Key: Use generate_prompt_key with the prompt and model parameters.
Check Cache: Attempt to retrieve an entry from the cache using the identifier.
Cache Hit: If data is found, return it immediately. This avoids the API call entirely.
Cache Miss: If no data is found, execute the LLM API call.
Store Result: Store the new response in the cache using the same identifier, so it's available for future requests.

The diagram below illustrates this flow, highlighting how caching bypasses expensive API calls.

The caching logic intercepts requests. A cache hit returns a stored response directly, while a miss proceeds to the API and stores the new result.

Let's see this in practice. First, we create an in-memory cache, which is fast and simple for single-session applications.

from kerb.cache import create_memory_cache

# Create a simple in-memory cache
cache = create_memory_cache(max_size=100)

Now, we can implement the workflow for a given prompt.

# Assume mock_llm_api_call exists and returns a dictionary with the response and cost
# from some_module import mock_llm_api_call 

prompt = "What is the weather like today?"
model_params = {"model": "gpt-4", "temperature": 0.7}

# 1. Generate
key = generate_prompt_key(prompt, **model_params)

# 2. Check the cache
cached_response = cache.get(key)

if cached_response:
    # 3. Cache Hit
    print("✓ Cache hit - no API call needed!")
    response = cached_response
else:
    # 4. Cache Miss
    print("✗ Cache miss - calling API")
    response = mock_llm_api_call(prompt, **model_params)

    # 5. Store the new response
    cache.set(key, response)

print(f"Response: {response['response']}")

This pattern is effective but can add boilerplate code to every function that calls an LLM. A better approach is to encapsulate this logic in a dedicated client class.

Building a Cached Client

For cleaner application code, you can build a wrapper class around your LLM client that handles caching automatically. This class will manage the cache instance and implement the check-then-compute logic inside its generation method.

Here is an example of a CachedLLMClient that provides a generate method with built-in caching.

class CachedLLMClient:
    """LLM client with automatic caching."""

    def __init__(self):
        self.cache = create_memory_cache()
        self.api_calls = 0
        self.cache_hits = 0

    def generate(self, prompt, model="gpt-4", temperature=0.7, **kwargs):
        """Generate response with automatic caching."""
        # Generate cache identifier from all relevant parameters
        key = generate_prompt_key(
            prompt=prompt,
            model=model,
            temperature=temperature,
            **kwargs
        )

        # Check cache
        cached = self.cache.get(key)
        if cached:
            self.cache_hits += 1
            return cached["response"]

        # If miss, call the actual API
        self.api_calls += 1
        response = mock_llm_api_call(prompt, model, temperature, **kwargs)

        # Store the new response in the cache
        self.cache.set(key, response)

        return response["response"]

    def stats(self):
        """Get usage statistics."""
        total = self.api_calls + self.cache_hits
        hit_rate = (self.cache_hits / total * 100) if total > 0 else 0
        return {
            "total_requests": total,
            "api_calls": self.api_calls,
            "cache_hits": self.cache_hits,
            "hit_rate": f"{hit_rate:.1f}%"
        }

# Using the client
client = CachedLLMClient()

print("Using CachedLLMClient:")
client.generate("What is machine learning?")
client.generate("What is machine learning?") # This will be a cache hit
client.generate("Explain neural networks")
client.generate("What is machine learning?") # This will be another cache hit

# Show statistics
stats = client.stats()
print(f"\nClient Statistics:")
print(f"  Total requests: {stats['total_requests']}")
print(f"  API calls:      {stats['api_calls']}")
print(f"  Cache hits:     {stats['cache_hits']}")
print(f"  Hit rate:       {stats['hit_rate']}")

This client made four requests but only two actual API calls, achieving a 50% hit rate. In applications with highly repetitive queries, this hit rate can be much higher, leading to dramatic improvements in performance and cost.

Tracking Cost Savings

A major benefit of caching is cost reduction. You can quantify these savings by storing the cost of each API call along with its response. When a cache hit occurs, you can log the saved amount.

The set method on a cache instance accepts optional metadata for storing extra information that isn't part of the cached value itself. This is a perfect place to store the cost of the original API call.

# In our CachedLLMClient or manual workflow...
# When a cache miss occurs:
response = mock_llm_api_call(prompt, model)
# Store the response and its associated cost in the metadata
cache.set(key, response, metadata={"cost": response["cost"]})

# When a cache hit occurs:
cached_entry = cache.get_entry(key) # get_entry retrieves value and metadata
if cached_entry:
    response = cached_entry.value
    saved_cost = cached_entry.metadata.get("cost", 0.0)
    total_saved += saved_cost

By tracking total_saved, you can directly measure the financial impact of your caching implementation. In a production system, this data is invaluable for monitoring operational expenses and demonstrating the ROI of performance optimizations.

Cache Backends and Persistence

The kerb.cache module offers several storage backends for different use cases.

MemoryCache: An in-memory cache, created with create_memory_cache(). It's extremely fast but volatile, meaning the cache is cleared when the application restarts. It is ideal for serving state within a single running process, like a web server.
DiskCache: A file-system-based cache, created with create_disk_cache(). It persists data across application restarts by writing to disk. While slower than MemoryCache, it's useful for command-line tools or batch jobs that need to reuse results from previous runs.
```
# Create a cache that stores data in a local .cache/llm directory
disk_cache = create_disk_cache(cache_dir=".cache/llm", serializer="json")
```
TieredCache: This backend, created with create_tiered_cache(), combines MemoryCache and DiskCache. It provides the speed of an in-memory cache for frequently accessed items while using a disk cache for persistence and as a larger, slower backup. This is often the best choice for production applications that require both high performance and data persistence.

Was this section helpful?

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - Covers fundamental concepts of data storage, distributed systems, and caching mechanisms, valuable for designing effective LLM caching architectures.
Caching LLM calls, OpenAI, 2023 OpenAI Cookbook (OpenAI) - Offers practical examples and guidance on implementing caching specifically for Large Language Model API calls, covering key generation and workflow.
High Performance Python: Practical Performant Programming for Humans, Micha Gorelick, Ian Ozsvald, 2020 (O'Reilly Media) - Includes discussions on various optimization techniques, including memoization and caching, relevant for improving the efficiency and speed of Python applications interacting with external services.