Theory provides the foundation, but practical application solidifies understanding. Let's put the optimization concepts discussed in this chapter into practice by tuning a sample LangChain chain. We'll identify performance issues, apply specific techniques, and measure the impact.
Imagine a chain designed to answer questions based on a collection of technical reports. The process involves:
This multi-step process is common but can introduce latency and increase costs due to multiple LLM interactions and data retrieval operations.
Let's assume our initial chain implementation looks something like this conceptually:
# Assume retriever, llm_initial, llm_refine are pre-configured
# retriever: A vector store retriever
# llm_initial: A moderately sized LLM for quick answer generation
# llm_refine: A larger, more capable LLM for refinement
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
# Simplified RAG setup
retrieve_docs = RunnablePassthrough.assign(
context=(lambda x: x["question"]) | retriever
)
# Initial answer prompt and chain
initial_prompt_template = ChatPromptTemplate.from_template(
"Based on this context:\n{context}\n\nAnswer the question: {question}"
)
initial_answer_chain = initial_prompt_template | llm_initial | StrOutputParser()
# Refinement prompt and chain
refine_prompt_template = ChatPromptTemplate.from_template(
"Refine this initial answer: '{initial_answer}' based on the original question: '{question}'. Ensure coherence and accuracy."
)
refine_chain = refine_prompt_template | llm_refine | StrOutputParser()
# Full chain combining steps
full_chain = retrieve_docs | RunnablePassthrough.assign(
initial_answer=initial_answer_chain
) | RunnablePassthrough.assign(
final_answer = (lambda x: {"initial_answer": x["initial_answer"], "question": x["question"]}) | refine_chain
)
# Example Invocation (Conceptual)
# result = full_chain.invoke({"question": "What are the scaling limits of System X?"})
# print(result["final_answer"])
Before optimizing, we need a baseline. We can use simple timing or integrate with a tracing tool like LangSmith (covered in Chapter 5). For simplicity, let's use basic timing. We run the chain with a sample question multiple times and average the results.
import time
import statistics
question = "Summarize the key findings regarding performance degradation under load."
num_runs = 5
latencies = []
# Assume token tracking is implemented separately or via LangSmith
for _ in range(num_runs):
start_time = time.time()
# result = full_chain.invoke({"question": question}) # Execute the chain
# Simulate execution time for demonstration
time.sleep(12 + (random.random() * 6 - 3)) # Simulate 9-15 sec latency
end_time = time.time()
latencies.append(end_time - start_time)
average_latency = statistics.mean(latencies)
print(f"Average latency (baseline): {average_latency:.2f} seconds")
# Let's assume baseline token count observed via logs/LangSmith: ~2100 tokens per query
Let's say our baseline measurement yields:
Using LangSmith or detailed logging, we might find the breakdown:
llm_initial
): 4.0 seconds (800 tokens)llm_refine
): 8.0 seconds (1300 tokens)The refinement step (llm_refine
) is the most significant bottleneck in both latency and token consumption.
Let's apply some techniques discussed earlier.
Technique 1: Caching LLM Responses
Identical questions or intermediate processing steps might occur frequently. Caching LLM responses can dramatically reduce latency and cost for repeat requests. Let's add an in-memory cache. For production, you'd typically use a more persistent cache like Redis, SQL, or specialized vector caching.
from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache
# Set up a simple in-memory cache
set_llm_cache(InMemoryCache())
# No changes needed to the chain definition itself if LLMs are configured globally
# Or, apply cache directly when initializing LLMs:
# llm_initial = ChatOpenAI(..., cache=InMemoryCache())
# llm_refine = ChatOpenAI(..., cache=InMemoryCache())
# Re-run the timing test, ensuring to run the *same* question multiple times
# The first run will be slow, subsequent identical runs should be much faster.
After adding caching and running the same query again:
Caching is highly effective for repeated inputs but doesn't help with novel queries.
Technique 2: Optimizing the Refinement Step
The refinement LLM call is our primary bottleneck for novel queries.
Prompt Engineering: Can we make the refinement prompt more concise? Perhaps the initial prompt can request a more structured output that requires less refinement. Let's assume we refine the refine_prompt_template
to be slightly shorter, saving maybe 50 tokens per call on average.
Model Selection: Is the powerful llm_refine
strictly necessary? Could a slightly smaller, faster model achieve acceptable quality? Let's hypothetically switch llm_refine
to a model known to be ~30% faster and use ~30% fewer tokens on average for similar tasks, perhaps accepting a minor quality trade-off.
Conditional Execution: Maybe refinement isn't always needed. We could add a step before refinement that uses a simpler model or rule-based check to determine if the initial answer is good enough. If it is, skip the refinement call entirely.
Let's simulate the effect of switching llm_refine
to a faster model and refining the prompt slightly.
# Assume llm_refine_faster is configured (a faster, slightly less powerful model)
# Assume refine_prompt_template_optimized is slightly shorter
# Update the refine_chain part
refine_chain_optimized = refine_prompt_template_optimized | llm_refine_faster | StrOutputParser()
# Update the full chain definition to use the optimized refine chain
full_chain_optimized = retrieve_docs | RunnablePassthrough.assign(
initial_answer=initial_answer_chain
) | RunnablePassthrough.assign(
final_answer = (lambda x: {"initial_answer": x["initial_answer"], "question": x["question"]}) | refine_chain_optimized
)
# Re-run the timing test for novel queries (cache won't help here initially)
# ... timing code ...
Let's measure the performance of full_chain_optimized
for novel queries (cache miss scenario):
# Simulate execution time for demonstration after optimization
# Retrieval: 1.5s (no change)
# Initial LLM: 4.0s (no change, 800 tokens)
# Refine LLM (Faster Model + Prompt): 8.0s * 0.7 ≈ 5.6s (1300 tokens * 0.7 - 50 ≈ 860 tokens)
# Total Latency ≈ 1.5 + 4.0 + 5.6 = 11.1s
# Total Tokens ≈ 800 + 860 = 1660 tokens
# --- Python code to simulate and measure ---
latencies_optimized = []
for _ in range(num_runs):
start_time = time.time()
# result = full_chain_optimized.invoke({"question": question}) # Execute the optimized chain
# Simulate optimized execution time
time.sleep(10 + (random.random() * 3 - 1.5)) # Simulate 8.5 - 11.5 sec latency
end_time = time.time()
latencies_optimized.append(end_time - start_time)
average_latency_optimized = statistics.mean(latencies_optimized)
print(f"Average latency (optimized): {average_latency_optimized:.2f} seconds")
# Estimated optimized token count: ~1660 tokens
Our new measurements for novel queries might look like this:
Let's visualize the improvement:
Comparison of average latency and token usage before and after applying caching and model optimization techniques. Note the dramatic improvement for cached queries.
Reducing token count directly impacts cost. If the combined cost of llm_initial
and llm_refine
was $0.002 per 1K tokens:
We achieved significant improvements:
This practice exercise demonstrates a typical tuning workflow:
Remember to leverage tools like LangSmith for detailed tracing and analysis, which simplifies the identification and measurement phases considerably in complex applications.
© 2025 ApX Machine Learning