Caching text embeddings provides substantial benefits, dramatically improving performance for many applications. For example, in Retrieval-Augmented Generation (RAG), the embedding process is often a major bottleneck. Every document chunk must be converted into a vector, and each conversion requires an API call that costs both time and money.
During development, you might run your indexing pipeline dozens of times. In production, documents might be processed repeatedly by different services. Without caching, you would re-compute the same embeddings for the exact same text, needlessly consuming resources.
Embedding caching works by storing the vector representation of a piece of text. When your application needs to embed a text, it first checks if a vector for that exact text already exists in the cache.
The process follows these steps:
text-embedding-3-small vs. a local model) will have separate cache entries.This mechanism is particularly effective because embeddings are deterministic; the same text and model will always produce the same vector. The cache module provides specialized tools to handle this workflow.
The LLMCache class provides high-level methods specifically for handling embeddings: cache_embedding and get_cached_embedding. These work in tandem with generate_embedding_key to create consistent, unique keys.
Let's start by creating a simple in-memory cache and manually checking for hits and misses. We will simulate an embedding API call to make it clear when the cache is being used.
import time
from typing import List
from kerb.cache import create_llm_cache, generate_embedding_key
# Simulate an expensive embedding API call
def get_embedding_from_api(text: str, model: str) -> List[float]:
print(f" -> Calling embedding API for '{text[:20]}...' with model {model}")
time.sleep(0.5) # Simulate network latency
# Return a mock embedding based on text length
return [len(text) / 100.0] * 5
# 1. Create an LLM-specific cache
llm_cache = create_llm_cache()
# 2. Define the text and model
text_to_embed = "This is the document chunk we need to embed."
model_name = "text-embedding-3-small"
# --- First Request (Cache Miss) ---
print("First request for the text:")
embedding = llm_cache.get_cached_embedding(text=text_to_embed, model=model_name)
if embedding is None:
print(" ✗ Cache miss. Generating and caching embedding.")
# Call the API
new_embedding = get_embedding_from_api(text=text_to_embed, model=model_name)
# Store in cache
llm_cache.cache_embedding(text=text_to_embed, embedding=new_embedding, model=model_name)
embedding = new_embedding
else:
print(" ✓ Cache hit!")
print(f" Embedding vector (first 5 values): {embedding[:5]}")
# --- Second Request (Cache Hit) ---
print("\nSecond request for the same text:")
embedding = llm_cache.get_cached_embedding(text=text_to_embed, model=model_name)
if embedding is None:
print(" ✗ Cache miss. This shouldn't happen!")
else:
print(" ✓ Cache hit! Returned stored embedding instantly.")
print(f" Embedding vector (first 5 values): {embedding[:5]}")
In the first request, the cache is empty, so we see the "Cache miss" message, and our simulated API function is called. The resulting vector is then stored. On the second request for the exact same text and model, the cache returns the stored vector immediately, skipping the expensive API call entirely.
In a real application, you would wrap this logic in a helper function to keep your main code clean. Let's see how this would look in a typical RAG document processing pipeline, where we process a list of document chunks. This is where caching provides the most value, as the same documents are often processed repeatedly during development and testing.
Here, we'll track the cost savings. LLMCache is designed to help with this by tracking hits and misses and estimating time and money saved.
from kerb.cache import create_llm_cache, create_tiered_cache
from kerb.cache.backends import LLMCache
# Simulate an embedding API call with cost
def embed_with_cost(text: str, model: str) -> dict:
print(f" -> Calling embedding API for '{text[:20]}...'")
time.sleep(0.5)
cost_per_token = 0.00002 # Example cost for ada-002
num_tokens = len(text.split())
return {
"embedding": [num_tokens / 100.0] * 5,
"cost": num_tokens * cost_per_token
}
def get_embedding_with_cache(text: str, model: str, cache: LLMCache) -> List[float]:
"""Gets an embedding, using the cache if possible."""
cached_embedding = cache.get_cached_embedding(text=text, model=model)
if cached_embedding:
print(f" ✓ Cache hit for '{text[:20]}...'")
return cached_embedding
print(f" ✗ Cache miss for '{text[:20]}...'")
result = embed_with_cost(text, model)
cache.cache_embedding(
text=text,
embedding=result["embedding"],
model=model,
cost=result["cost"]
)
return result["embedding"]
# Use a persistent tiered cache for this example
# This saves embeddings to disk across script runs
persistent_cache_backend = create_tiered_cache(disk_cache_dir=".embedding_cache")
llm_cache = create_llm_cache(backend=persistent_cache_backend)
document_chunks = [
"RAG combines retrieval with generation.",
"Embeddings convert text into vectors.",
"Caching reduces latency and cost.",
"RAG combines retrieval with generation.", # Duplicate
"Embeddings convert text into vectors.", # Duplicate
]
print("Processing document chunks for RAG pipeline (first run):")
for chunk in document_chunks:
get_embedding_with_cache(text=chunk, model="text-embedding-3-small", cache=llm_cache)
stats_run1 = llm_cache.get_stats()
print(f"\n--- First Run Stats ---")
print(f"Cache Hits: {stats_run1.hits}")
print(f"Cache Misses: {stats_run1.misses}")
print(f"Estimated Cost Saved: ${stats_run1.estimated_cost_saved:.6f}")
print("\nProcessing same document chunks again (second run):")
for chunk in document_chunks:
get_embedding_with_cache(text=chunk, model="text-embedding-3-small", cache=llm_cache)
stats_run2 = llm_cache.get_stats()
print(f"\n--- Cumulative Stats After Second Run ---")
print(f"Cache Hits: {stats_run2.hits}")
print(f"Cache Misses: {stats_run2.misses}")
print(f"Estimated Cost Saved: ${stats_run2.estimated_cost_saved:.6f}")
Notice how the first run results in three cache misses for the unique chunks. The second run, however, results in five cache hits because all chunks, including the duplicates, were already processed and stored. The cost savings accumulate with every subsequent run.
The choice of cache backend determines how and where the embeddings are stored. The cache module offers several options suitable for different scenarios.
create_memory_cache(): Creates an in-memory cache. It's extremely fast but volatile; the cache is cleared when your application stops. This is ideal for development, testing, or short-lived processes where you want to avoid redundant calls within a single run.create_disk_cache(cache_dir="..."): Creates a persistent cache on disk. Embeddings are saved to files in the specified directory, so they persist between application restarts. This is the best choice for RAG indexing pipelines that you run multiple times during development.create_tiered_cache(): This creates a two-level cache that combines a fast in-memory cache with a persistent disk cache. It offers the best of both worlds: instant access for frequently used items in memory, with a larger, persistent store on disk. This is the recommended backend for most production applications.You can easily swap the backend when creating your high-level LLMCache. For example, to set up a persistent cache for your application:
from kerb.cache import create_llm_cache, create_tiered_cache
# Create a tiered cache with a memory size of 200 items
# and a persistent disk cache in the '.cache/embeddings' directory
tiered_backend = create_tiered_cache(
memory_max_size=200,
disk_cache_dir=".cache/embeddings"
)
# Wrap it with LLMCache to get embedding-specific methods
llm_cache = create_llm_cache(backend=tiered_backend)
# Now use the cache as before
# embedding = get_embedding_with_cache(text="...", model="...", cache=llm_cache)
By implementing embedding caching, you build a more efficient, cost-effective, and faster application. It's a foundational optimization technique that pays significant dividends, especially in systems that process large volumes of text.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with