Now that we have our documents loaded, split into manageable chunks, embedded into vectors, and indexed within a vector store, the next step is to retrieve the relevant information when a user poses a query. This is the "Retrieval" part of Retrieval Augmented Generation (RAG), and it relies on the concept of semantic search.
Unlike keyword search, which matches exact words, semantic search finds text that is contextually and semantically similar in meaning, even if the wording is different. This is possible because our text embedding models have captured the meaning of the document chunks as vectors in a high dimensional space. Chunks with similar meanings are located closer together in this vector space.
When a user submits a query (e.g., "What were the main findings of the Q3 report?"), the first step in the retrieval process is to convert this query into an embedding vector using the same text embedding model that was used to embed the document chunks. This is important; comparing vectors generated by different models is generally meaningless.
Once we have the query vector, we use the vector store to find the document chunk vectors that are "closest" to the query vector in the embedding space. Closeness is typically measured using a similarity metric.
Several metrics can quantify the similarity between two vectors. The most common one used in semantic search is Cosine Similarity.
Cosine Similarity: Measures the cosine of the angle between two vectors. It ranges from -1 (exactly opposite) to 1 (exactly the same), with 0 indicating orthogonality (no similarity). For text embeddings, which are often normalized, cosine similarity effectively measures orientation similarity, irrespective of vector magnitude. A higher cosine similarity score means the vectors point in more similar directions, indicating greater semantic relevance. The formula is:
Cosine Similarity(A,B)=∥A∥∥B∥A⋅BWhere A and B are the vectors, A⋅B is the dot product, and ∥A∥ and ∥B∥ are their magnitudes.
Euclidean Distance (L2 Distance): Measures the straight-line distance between the endpoints of two vectors. Lower distance means higher similarity.
Euclidean Distance(A,B)=i=1∑n(Ai−Bi)2Dot Product: Sometimes used directly, especially with normalized vectors where it becomes equivalent to cosine similarity.
Most vector stores default to using cosine similarity for semantic search tasks, as it generally performs well for high dimensional text embeddings.
Vector stores provide efficient methods to perform these similarity searches, often referred to as k-Nearest Neighbors (k-NN) searches. Given a query vector, the vector store rapidly identifies the k vectors in its index that have the highest similarity score (or lowest distance) to the query vector.
Let's illustrate with a conceptual Python example. Assume we have a vector_store
object (representing our connection to FAISS, ChromaDB, Pinecone, etc.) and an embedding_function
capable of converting text to vectors:
# Assume vector_store and embedding_function are already initialized
user_query = "What safety protocols were updated last month?"
# 1. Embed the user query
query_embedding = embedding_function.embed_query(user_query)
# 2. Perform similarity search in the vector store
# We typically ask for the top 'k' most similar documents. Let's ask for 4.
k = 4
search_results = vector_store.similarity_search_with_score(
query=query_embedding,
k=k
)
# 3. Process the results
retrieved_docs = []
for doc, score in search_results:
print(f"Score: {score:.4f}")
print(f"Content: {doc.page_content[:200]}...") # Show snippet
print("-" * 20)
retrieved_docs.append(doc) # 'doc' usually contains the text chunk and metadata
# 'retrieved_docs' now holds the text of the k most relevant chunks
In this example:
user_query
to get query_embedding
.similarity_search_with_score
on our vector_store
. We pass the query_embedding
and specify k=4
, meaning we want the 4 most similar document chunks. Vector store libraries might have slightly different method names (search
, query
, get_nearest_neighbors
), but the principle is the same.Conceptual visualization of embeddings in 2D space. The search aims to find document points (blue) closest to the query point (red). Here, Doc 1, Doc 2, and Doc 4 would likely be retrieved if k=3.
The parameter k, the number of documents to retrieve, is an important setting.
The optimal value for k often depends on the specific application, the length of the document chunks, and the nature of the expected queries. It usually requires some experimentation. Starting with a value like 3 or 5 is common, followed by evaluation and adjustment. Some systems also implement relevance score thresholds, only including documents above a certain similarity score, rather than a fixed number k.
After this retrieval step, we have a set of document chunks deemed most relevant to the user's query based on semantic similarity. The next section explains how to combine this retrieved context with the original query into a new prompt for the LLM.
© 2025 ApX Machine Learning