Now that we understand how data like text or images can be transformed into numerical vectors residing in a high-dimensional space, the next logical step is to figure out how to quantify the relationship between these vectors. If our embedding model has successfully captured semantic meaning, vectors representing similar concepts should be "close" to each other in this space, while dissimilar concepts should be "far apart". Measuring this proximity is fundamental for tasks like finding related documents, recommending similar items, or performing semantic search.
In vector spaces, "closeness" is typically measured using distance or similarity metrics. These mathematical functions take two vectors as input and output a scalar value indicating their similarity or dissimilarity. Let's examine the most common metrics used in the context of vector embeddings.
Cosine Similarity is arguably the most popular metric for comparing high-dimensional embeddings, especially in natural language processing. Instead of measuring the absolute distance between the endpoints of two vectors, it measures the cosine of the angle between them.
The formula for Cosine Similarity between two vectors a and b is:
Cosine Similarity(a,b)=∣∣a∣∣∣∣b∣∣a⋅b=∑i=1nai2∑i=1nbi2∑i=1naibiWhere:
Intuition: Cosine Similarity focuses purely on the orientation or direction of the vectors, ignoring their magnitudes. Imagine two arrows originating from the same point. If they point in the exact same direction, the angle between them is 0°, and the cosine is 1 (maximum similarity). If they point in opposite directions, the angle is 180°, and the cosine is -1 (maximum dissimilarity). If they are orthogonal (perpendicular), the angle is 90°, and the cosine is 0 (no correlation or similarity in direction).
The result always falls within the range [-1, 1]. For many embedding models, vectors are normalized to have a unit length (∣∣v∣∣=1). In such cases, the denominator becomes 1, and Cosine Similarity simplifies to just the dot product (a⋅b). When dealing with embeddings where only relative direction matters (common for text semantics), Cosine Similarity is often the preferred choice. A value closer to 1 indicates higher semantic similarity.
Euclidean Distance is the standard "ruler" distance we learn about in geometry. It measures the straight-line distance between the endpoints of two vectors in the vector space.
The formula for Euclidean Distance between two vectors a and b is:
Euclidean Distance(a,b)=∣∣a−b∣∣2=i=1∑n(ai−bi)2Where:
Intuition: Euclidean Distance considers both the direction and the magnitude of the vectors. Two vectors are considered close if their endpoints are near each other in the space. The result is always non-negative (≥0), where 0 indicates the vectors are identical. Larger values signify greater dissimilarity.
Unlike Cosine Similarity, Euclidean Distance is sensitive to the length (magnitude) of the vectors. If one vector is much longer than another but points in a similar direction, the Euclidean distance might be large, while the Cosine Similarity would still be high. For embeddings where magnitude carries meaning (though less common in pure semantic search), or for specific types of clustering, Euclidean Distance can be useful. Keep in mind that in high-dimensional spaces, the concept of distance can become less intuitive (curse of dimensionality), which is one reason Cosine Similarity is often favored.
Vectors A and B in 2D space. Cosine Similarity depends on the angle θ, while Euclidean Distance is the length of the dashed red line connecting the vector endpoints.
The Dot Product (or Inner Product) is another way to compare vectors. It's closely related to Cosine Similarity.
The formula for the Dot Product between two vectors a and b is:
a⋅b=i=1∑naibiIt's also defined geometrically as:
a⋅b=∣∣a∣∣∣∣b∣∣cos(θ)Where θ is the angle between the vectors.
Intuition: The Dot Product considers both the angle (like Cosine Similarity) and the magnitudes of the vectors. A larger dot product can result from vectors having large magnitudes, being closely aligned directionally, or both. The range of the dot product is (−∞,∞).
If the vectors are normalized to unit length (∣∣a∣∣=∣∣b∣∣=1), then the Dot Product becomes identical to Cosine Similarity (a⋅b=cos(θ)). Many vector databases and libraries optimize calculations for normalized vectors, making the Dot Product a computationally efficient way to calculate Cosine Similarity in practice. However, if vectors are not normalized, the interpretation is less straightforward for similarity, as magnitude heavily influences the score.
In most vector database applications focused on semantic search, you'll primarily encounter and utilize Cosine Similarity, often implemented via optimized Dot Product calculations on normalized vectors. Understanding these metrics is essential for interpreting search results and configuring your vector database index correctly, as we'll see in later chapters.
© 2025 ApX Machine Learning