While embedding-based vector search provides a powerful mechanism for retrieving information based on semantic similarity, relying solely on nearest neighbors to a raw query embedding often falls short in complex agentic scenarios. The initial query might be ambiguous, underspecified, or require synthesizing information from multiple distinct perspectives. Naive retrieval can lead to suboptimal context being fed into the LLM, hindering reasoning quality and task success. Advanced retrieval strategies aim to overcome these limitations by refining the query process, diversifying results, or improving relevance scoring.
One significant challenge is bridging the semantic gap between a short, potentially abstract query and the detailed, content-rich documents stored in memory. A query like "What are the latest advancements in agent memory?" might not embed closely to a specific paper detailing a new technique until the paper explicitly mentions "advancements in agent memory."
HyDE addresses this by using the LLM itself to first generate a hypothetical document (or answer) that represents what an ideal response might look like. Instead of embedding the original query q, we generate a hypothetical document dhypo=LLM(q). Then, we embed this hypothetical document to get ehypo=f(dhypo) and use ehypo to perform the similarity search against the actual document embeddings in the vector store.
The intuition is that dhypo, even if factually incorrect or incomplete, likely contains terms, phrasing, and semantic structures closer to the truly relevant documents than the original sparse query q.
High-level workflow for Hypothetical Document Embeddings (HyDE).
Using HyDE involves an additional LLM call upfront, increasing latency and cost. The effectiveness also hinges on the quality of the generated hypothetical document. A poor dhypo might skew the search towards irrelevant results. However, for queries where the user's intent is clear but lacks specific terminology found in the target documents, HyDE can significantly improve retrieval relevance.
Often, a single query vector fails to capture the multifaceted nature of an information need. An agent might need to understand different aspects or perspectives related to a complex topic. Multi-Query Retrieval tackles this by generating several related queries from the original one, performing a search for each, and then combining the results.
The process typically involves:
Multi-Query Retrieval workflow generating multiple search vectors.
This approach enhances the diversity of retrieved documents, increasing the likelihood of covering different relevant subtopics. The trade-off is the increased number of queries sent to the vector store (potentially N times the cost and latency of a single query) and the added LLM call for query generation. Careful merging is also needed to avoid excessive redundancy in the final context.
Vector similarity search excels at finding semantically related documents quickly from a massive corpus. However, the top-k results based purely on cosine similarity or Euclidean distance of embeddings might not always be the most relevant according to more nuanced criteria. Reranking introduces a second stage to refine the initial candidate set.
The typical flow is:
Reranking models are often cross-encoders. Unlike bi-encoders used for initial retrieval (which embed query and document independently), cross-encoders process the query and a candidate document together, allowing for deeper interaction analysis (e.g., using cross-attention). This yields more accurate relevance scores but is computationally too expensive to apply to the entire corpus.
Example showing how reranking might change the relevance order of initially retrieved documents. Doc B and Doc D move up significantly after reranking.
Reranking adds latency due to the second-stage scoring process. The choice of reranker model (its size, training data, architecture) impacts both the quality improvement and the performance overhead. However, for applications demanding high precision in the retrieved context, reranking is often a valuable addition.
Different retrieval methods have distinct strengths. Keyword search (like BM25) excels at matching specific terms, while vector search captures semantic meaning. Hybrid search combines results from multiple retrieval systems. A common technique is Reciprocal Rank Fusion (RRF), which combines ranked lists from different sources based on rank positions rather than absolute scores, making it robust to score incomparability. By fusing results from, say, vector search and keyword search, agents can benefit from both semantic understanding and term specificity.
There's no single "best" advanced retrieval strategy. The optimal choice depends on factors like:
Implementing these advanced strategies requires careful consideration of trade-offs between retrieval quality, latency, and computational cost. However, moving beyond basic nearest-neighbor search is frequently necessary to build truly effective and knowledgeable LLM agents capable of complex reasoning and long-horizon tasks. Experimentation and evaluation, as discussed in Chapter 6, are essential for determining the most suitable approach for your specific application.
© 2025 ApX Machine Learning