When building LLM applications that are used frequently, you will face two practical challenges: latency and operational cost. Each call to a language model or an embedding service takes time to process and incurs a fee, which is often calculated based on the number of tokens. The total cost for a single API call can be represented as:
where is the price per token and is the number of tokens for the prompt and completion. For applications that handle repetitive queries, these costs can accumulate quickly.
This chapter introduces caching as a primary technique to address these issues. We will start by identifying common performance bottlenecks in an LLM-driven system. You will then learn to implement two specific caching strategies using the cache module:
Finally, we will cover the practical matter of cache invalidation to ensure your application's data remains current. By working through these sections, you will learn how to make your applications faster and more cost-effective.
9.1 Identifying Performance Bottlenecks
9.2 Implementing LLM Response Caching
9.3 Caching Embeddings to Reduce API Calls
9.4 Cache Invalidation Strategies
© 2026 ApX Machine LearningEngineered with