Caching Embeddings to Reduce API Calls

Caching text embeddings provides substantial benefits, dramatically improving performance for many applications. For example, in Retrieval-Augmented Generation (RAG), the embedding process is often a major bottleneck. Every document chunk must be converted into a vector, and each conversion requires an API call that costs both time and money.

During development, you might run your indexing pipeline dozens of times. In production, documents might be processed repeatedly by different services. Without caching, you would re-compute the same embeddings for the exact same text, needlessly consuming resources.

The Logic of Embedding Caching

Embedding caching works by storing the vector representation of a piece of text. When your application needs to embed a text, it first checks if a vector for that exact text already exists in the cache.

The process follows these steps:

Generate a Cache Key: A unique key is created by hashing the text content and the name of the embedding model. This ensures that the same text embedded with different models (e.g., text-embedding-3-small vs. a local model) will have separate cache entries.
Check the Cache: The system looks for this item in the cache.
Cache Hit: If the identifier is found, the stored embedding vector is returned instantly, avoiding an API call.
Cache Miss: If the key is not found, the application calls the embedding API, receives the vector, and stores it in the cache using the generated key before returning it.

This mechanism is particularly effective because embeddings are deterministic; the same text and model will always produce the same vector. The cache module provides specialized tools to handle this workflow.

Implementing an Embedding Cache

The LLMCache class provides high-level methods specifically for handling embeddings: cache_embedding and get_cached_embedding. These work in tandem with generate_embedding_key to create consistent, unique keys.

Let's start by creating a simple in-memory cache and manually checking for hits and misses. We will simulate an embedding API call to make it clear when the cache is being used.

import time
from typing import List
from kerb.cache import create_llm_cache, generate_embedding_key

# Simulate an expensive embedding API call
def get_embedding_from_api(text: str, model: str) -> List[float]:
    print(f"  -> Calling embedding API for '{text[:20]}...' with model {model}")
    time.sleep(0.5)  # Simulate network latency
    # Return a mock embedding based on text length
    return [len(text) / 100.0] * 5

# 1. Create an LLM-specific cache
llm_cache = create_llm_cache()

# 2. Define the text and model
text_to_embed = "This is the document chunk we need to embed."
model_name = "text-embedding-3-small"

# --- First Request (Cache Miss) ---
print("First request for the text:")
embedding = llm_cache.get_cached_embedding(text=text_to_embed, model=model_name)

if embedding is None:
    print("  ✗ Cache miss. Generating and caching embedding.")
    # Call the API
    new_embedding = get_embedding_from_api(text=text_to_embed, model=model_name)
    # Store in cache
    llm_cache.cache_embedding(text=text_to_embed, embedding=new_embedding, model=model_name)
    embedding = new_embedding
else:
    print("  ✓ Cache hit!")

print(f"  Embedding vector (first 5 values): {embedding[:5]}")

# --- Second Request (Cache Hit) ---
print("\nSecond request for the same text:")
embedding = llm_cache.get_cached_embedding(text=text_to_embed, model=model_name)

if embedding is None:
    print("  ✗ Cache miss. This shouldn't happen!")
else:
    print("  ✓ Cache hit! Returned stored embedding instantly.")

print(f"  Embedding vector (first 5 values): {embedding[:5]}")

In the first request, the cache is empty, so we see the "Cache miss" message, and our simulated API function is called. The resulting vector is then stored. On the second request for the exact same text and model, the cache returns the stored vector immediately, skipping the expensive API call entirely.

Integrating Caching into a RAG Pipeline

In a real application, you would wrap this logic in a helper function to keep your main code clean. Let's see how this would look in a typical RAG document processing pipeline, where we process a list of document chunks. This is where caching provides the most value, as the same documents are often processed repeatedly during development and testing.

Here, we'll track the cost savings. LLMCache is designed to help with this by tracking hits and misses and estimating time and money saved.

from kerb.cache import create_llm_cache, create_tiered_cache
from kerb.cache.backends import LLMCache

# Simulate an embedding API call with cost
def embed_with_cost(text: str, model: str) -> dict:
    print(f"  -> Calling embedding API for '{text[:20]}...'")
    time.sleep(0.5)
    cost_per_token = 0.00002  # Example cost for ada-002
    num_tokens = len(text.split())
    return {
        "embedding": [num_tokens / 100.0] * 5,
        "cost": num_tokens * cost_per_token
    }

def get_embedding_with_cache(text: str, model: str, cache: LLMCache) -> List[float]:
    """Gets an embedding, using the cache if possible."""
    cached_embedding = cache.get_cached_embedding(text=text, model=model)
    if cached_embedding:
        print(f"  ✓ Cache hit for '{text[:20]}...'")
        return cached_embedding

    print(f"  ✗ Cache miss for '{text[:20]}...'")
    result = embed_with_cost(text, model)

    cache.cache_embedding(
        text=text,
        embedding=result["embedding"],
        model=model,
        cost=result["cost"]
    )
    return result["embedding"]

# Use a persistent tiered cache for this example
# This saves embeddings to disk across script runs
persistent_cache_backend = create_tiered_cache(disk_cache_dir=".embedding_cache")
llm_cache = create_llm_cache(backend=persistent_cache_backend)

document_chunks = [
    "RAG combines retrieval with generation.",
    "Embeddings convert text into vectors.",
    "Caching reduces latency and cost.",
    "RAG combines retrieval with generation.", # Duplicate
    "Embeddings convert text into vectors.", # Duplicate
]

print("Processing document chunks for RAG pipeline (first run):")
for chunk in document_chunks:
    get_embedding_with_cache(text=chunk, model="text-embedding-3-small", cache=llm_cache)

stats_run1 = llm_cache.get_stats()
print(f"\n--- First Run Stats ---")
print(f"Cache Hits: {stats_run1.hits}")
print(f"Cache Misses: {stats_run1.misses}")
print(f"Estimated Cost Saved: ${stats_run1.estimated_cost_saved:.6f}")

print("\nProcessing same document chunks again (second run):")
for chunk in document_chunks:
    get_embedding_with_cache(text=chunk, model="text-embedding-3-small", cache=llm_cache)

stats_run2 = llm_cache.get_stats()
print(f"\n--- Cumulative Stats After Second Run ---")
print(f"Cache Hits: {stats_run2.hits}")
print(f"Cache Misses: {stats_run2.misses}")
print(f"Estimated Cost Saved: ${stats_run2.estimated_cost_saved:.6f}")

Notice how the first run results in three cache misses for the unique chunks. The second run, however, results in five cache hits because all chunks, including the duplicates, were already processed and stored. The cost savings accumulate with every subsequent run.

Choosing a Cache Backend

The choice of cache backend determines how and where the embeddings are stored. The cache module offers several options suitable for different scenarios.

create_memory_cache(): Creates an in-memory cache. It's extremely fast but volatile; the cache is cleared when your application stops. This is ideal for development, testing, or short-lived processes where you want to avoid redundant calls within a single run.
create_disk_cache(cache_dir="..."): Creates a persistent cache on disk. Embeddings are saved to files in the specified directory, so they persist between application restarts. This is the best choice for RAG indexing pipelines that you run multiple times during development.
create_tiered_cache(): This creates a two-level cache that combines a fast in-memory cache with a persistent disk cache. It offers the best of both worlds: instant access for frequently used items in memory, with a larger, persistent store on disk. This is the recommended backend for most production applications.

You can easily swap the backend when creating your high-level LLMCache. For example, to set up a persistent cache for your application:

from kerb.cache import create_llm_cache, create_tiered_cache

# Create a tiered cache with a memory size of 200 items
# and a persistent disk cache in the '.cache/embeddings' directory
tiered_backend = create_tiered_cache(
    memory_max_size=200,
    disk_cache_dir=".cache/embeddings"
)

# Wrap it with LLMCache to get embedding-specific methods
llm_cache = create_llm_cache(backend=tiered_backend)

# Now use the cache as before
# embedding = get_embedding_with_cache(text="...", model="...", cache=llm_cache)

By implementing embedding caching, you build a more efficient, cost-effective, and faster application. It's a foundational optimization technique that pays significant dividends, especially in systems that process large volumes of text.

Was this section helpful?

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.2005.11401 - This paper introduced the Retrieval-Augmented Generation (RAG) framework, which relies on efficient text embedding and retrieval, making embedding caching an important optimization.
OpenAI Embeddings, OpenAI, 2024 (OpenAI) - Official documentation providing practical guidance on using OpenAI's embedding models, detailing their function, available models, and API integration, relevant to managing embedding API calls.
Designing Data-Intensive Applications: The Art of Robust, Scalable, and Maintainable Systems, Martin Kleppmann, 2017 (O'Reilly Media) - This book provides an in-depth discussion of caching mechanisms and strategies in distributed systems, explaining fundamental principles that apply to optimizing performance and reducing resource consumption.
Building Machine Learning Powered Applications: Going from Idea to Product, Emmanuel Ameisen, 2020 (O'Reilly Media) - This resource covers practical considerations for building and deploying machine learning applications, including strategies for performance improvement and cost management in production environments.