To make an application faster or more cost-effective, understanding where it spends its time and resources is essential. Traditional software often experiences bottlenecks in database queries, complex calculations, or I/O operations. While LLM applications share some of these, they introduce two primary sources of latency and cost that often overshadow all others: calls to external model APIs.
A typical application, especially one using Retrieval-Augmented Generation (RAG), follows a multi-stage process. Some stages run locally and are generally fast, while others involve network requests to third-party services, which introduce significant delays and costs.
A typical RAG application workflow. The most significant bottlenecks usually occur during the embedding and generation stages, which rely on external API calls.
Let's break down where performance issues commonly arise.
The most apparent bottleneck is the final generation call to an LLM. When your application sends a prompt and waits for a response, several factors contribute to latency:
Cost is directly tied to usage. As mentioned in the introduction, the cost is a function of both the input (prompt) and output (completion) tokens:
For an application that receives many identical or similar queries, these costs add up. For example, a customer support bot that repeatedly answers "What are your business hours?" makes a new, costly API call every single time.
The second major bottleneck is embedding generation. In a RAG system, every document chunk must be converted into a vector embedding before it can be indexed in a vector database. While this is often a one-time "ingestion" cost, it can be substantial. If you have 10,000 document chunks, you must make thousands of API calls to an embedding service. This process can be both time-consuming and expensive.
Furthermore, if your application processes new documents frequently or generates embeddings for user queries in real time, these calls contribute to ongoing operational costs and latency. Repetitive calls to embed the same text, such as common search terms or document headers, are an inefficient use of resources.
The first step in optimization is measurement. A simple way to identify bottlenecks is to time each major stage of your application's workflow.
Consider a simplified function that simulates a RAG query process. By wrapping each step with timing logic, you can pinpoint where the most time is spent.
import time
def mock_llm_api_call(prompt):
"""Simulate a slow LLM API call."""
time.sleep(2.5) # Simulate 2.5 seconds of latency
return f"This is a generated response to: {prompt[:50]}..."
def mock_embedding_api_call(text):
"""Simulate a faster but still significant embedding API call."""
time.sleep(0.1) # Simulate 100ms of latency
return [0.1] * 384 # Return a dummy vector
def run_rag_query(query: str):
"""Simulate a full RAG query and time each step."""
print(f"\nProcessing query: '{query}'")
# Step 1: Generate query embedding
start_time = time.time()
query_embedding = mock_embedding_api_call(query)
embed_duration = time.time() - start_time
print(f" 1. Embedding generation: {embed_duration:.4f}s")
# Step 2: Retrieve documents (simulated)
start_time = time.time()
time.sleep(0.05) # Simulate local vector search
retrieved_context = "Some relevant context is retrieved here."
retrieve_duration = time.time() - start_time
print(f" 2. Document retrieval: {retrieve_duration:.4f}s")
# Step 3: Call LLM for final generation
start_time = time.time()
prompt = f"Context: {retrieved_context}\n\nQuestion: {query}"
final_response = mock_llm_api_call(prompt)
generate_duration = time.time() - start_time
print(f" 3. LLM generation: {generate_duration:.4f}s")
total_duration = embed_duration + retrieve_duration + generate_duration
print(f" -------------------------------------")
print(f" Total time: {total_duration:.4f}s")
# Run the simulation
run_rag_query("What is the Kerb toolkit?")
Running this code produces output similar to this:
Processing query: 'What is the Kerb toolkit?'
1. Embedding generation: 0.1002s
2. Document retrieval: 0.0501s
3. LLM generation: 2.5003s
-------------------------------------
Total time: 2.6506s
The results are clear: the LLM generation call accounted for over 94% of the total request time. The embedding call, while much faster, was still twice as slow as the local retrieval step. This simple analysis immediately tells us that optimizing the API calls will yield the biggest performance gains.
With these bottlenecks identified, we can now explore solutions. The following sections will show you how to implement caching strategies with the cache module to dramatically reduce latency and cost by avoiding redundant API calls for both LLM responses and embeddings.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with