Okay, we've established that our text documents, or more precisely, chunks of them, can be transformed into numerical vectors using embedding models. Each vector lives in a high-dimensional space, where its position and direction encode the semantic meaning of the original text. Now, how do we use these vectors to find information relevant to a user's query? The answer lies in similarity search.
The core idea is straightforward: if two vectors are "close" to each other in this embedding space, the text chunks they represent are likely semantically similar. When a user submits a query (e.g., "What are the benefits of RAG?"), we first convert this query into a vector using the same embedding model that processed our documents. Then, the task is to find the document chunk vectors in our collection that are closest or most similar to this query vector.
How do we measure "closeness" or "similarity" between high-dimensional vectors? While you might initially think of Euclidean distance (the straight-line distance between two points), it's often not the best measure for semantic similarity derived from text embeddings. Why? Because the magnitude (length) of an embedding vector doesn't always correlate strongly with relevance. Two text snippets could have similar meanings but different lengths, potentially resulting in vectors with different magnitudes.
Instead, the most common metric used in RAG systems for comparing text embeddings is Cosine Similarity. This metric doesn't measure the distance between vector endpoints; it measures the cosine of the angle between the two vectors.
Think of two vectors starting from the origin (0,0,...). If they point in the exact same direction, the angle between them is 0°, and the cosine similarity is 1. If they are perpendicular (orthogonal), meaning they point in unrelated directions, the angle is 90°, and the cosine similarity is 0. If they point in opposite directions, the angle is 180°, and the cosine similarity is -1 (though in practice, with many common embedding techniques, values are often non-negative, ranging from 0 to 1).
The formula for cosine similarity between two vectors, A and B, is:
Cosine Similarity(A,B)=∥A∥∥B∥A⋅B=∑i=1nAi2∑i=1nBi2∑i=1nAiBiHere, A⋅B is the dot product of the vectors, and ∥A∥ and ∥B∥ are their magnitudes (or Euclidean norms). By dividing the dot product by the product of the magnitudes, we normalize the measure, making it insensitive to the vectors' lengths and focusing purely on their orientation. This alignment with orientation often captures semantic relatedness more effectively than distance metrics.
Imagine a simplified 2D space where we have a query vector and several document chunk vectors.
A 2D representation showing a query vector (red) and three document vectors. Vectors 1 and 2 (blue) have smaller angles relative to the query vector, indicating higher cosine similarity and semantic relevance, compared to vector 3 (gray).
In this visualization, the document vectors that form the smallest angles with the query vector (Doc 1 and Doc 2) would have the highest cosine similarity scores and would be considered the most relevant, even if their magnitudes differ. Doc 3, pointing in a very different direction, would have a low similarity score.
So, the retrieval process using similarity search typically involves these steps:
k
document chunks from the ranked list (e.g., the top 3 or top 5 most similar chunks). The value of k
is a parameter you can tune.While comparing the query vector to potentially millions or even billions of document vectors one by one (a "brute-force" or "exact nearest neighbor" search) yields the most accurate results according to the chosen metric, it becomes computationally expensive and slow for large datasets. This performance challenge is precisely why specialized systems, known as vector databases, are often employed. They utilize sophisticated indexing and search algorithms (like Approximate Nearest Neighbors - ANN) to find highly similar vectors much more quickly, sometimes sacrificing a tiny amount of accuracy for significant speed gains. We'll look at these databases next.
© 2025 ApX Machine Learning