Fundamentals of Vector Similarity

Embeddings provide an effective numerical representation of text meaning. To discover related content, you must determine the 'closeness' or 'similarity' between these numerical vectors. This measurement is performed using vector similarity metrics.

Think of each embedding as a point in a high-dimensional space. In this space, texts with similar meanings are located close to each other, while dissimilar texts are far apart. Similarity metrics are mathematical functions that calculate the distance or angular relationship between these points.

Related terms are grouped closely together in the vector space, forming distinct clusters based on their semantic meaning.

Several metrics exist, but for text embeddings, one stands out as the most common and effective: cosine similarity.

Cosine Similarity: Measuring by Angle

Cosine similarity measures the cosine of the angle between two vectors. This metric is particularly well-suited for text embeddings because it focuses on the direction of the vectors rather than their magnitude. The direction captures the semantic content, while the magnitude can sometimes be less important, especially since most modern embedding models produce normalized vectors (vectors with a length of 1).

The intuition is simple:

If two vectors point in the same direction, the angle between them is 0°, and their cosine similarity is 1. This indicates they are semantically identical.
If the vectors are orthogonal (at a 90° angle), their cosine similarity is 0. This suggests they are unrelated.
If they point in opposite directions (a 180° angle), their similarity is -1, indicating opposite meanings.

The formula for cosine similarity between two vectors $A$ and $B$ is:

\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}

When the vectors are normalized to have a length of 1, the denominator $\|A\| \|B\|$ becomes 1, and the cosine similarity simplifies to just the dot product of the two vectors, $A \cdot B$ .

You can calculate this easily using the cosine_similarity function. Let's compare the similarity between related and unrelated words.

from kerb.embedding import embed, cosine_similarity

# Generate embeddings for two similar words
vec1 = embed("Hello")
vec2 = embed("Hi")
sim_related = cosine_similarity(vec1, vec2)
print(f"Similarity between 'Hello' and 'Hi': {sim_related:.4f}")

# Generate embeddings for two unrelated words
vec3 = embed("Car")
sim_unrelated = cosine_similarity(vec1, vec3)
print(f"Similarity between 'Hello' and 'Car': {sim_unrelated:.4f}")

As expected, the similarity score for "Hello" and "Hi" is high, while the score for "Hello" and "Car" is very low, reflecting their lack of semantic connection.

Other Distance Metrics

While cosine similarity is the most common, the toolkit provides other metrics that can be useful in different contexts. Unlike cosine similarity, these are distance metrics, where a lower value indicates greater similarity.

Euclidean Distance (L2 Distance)

Euclidean distance is the straight-line distance between two points in the vector space. It accounts for both the direction and magnitude of the vectors. A distance of 0 means the vectors are identical.

Manhattan Distance (L1 Distance)

Manhattan distance, also known as "city block" distance, is the sum of the absolute differences of the vectors' components. It's often less sensitive to outliers than Euclidean distance.

Dot Product

The dot product measures how much one vector "goes into" another. For normalized vectors, it's identical to cosine similarity. However, for unnormalized vectors, it is sensitive to vector magnitude, meaning longer vectors can have a disproportionately large influence.

Let's see how these metrics compare for the same set of texts.

from kerb.embedding import embed, euclidean_distance, manhattan_distance, dot_product

vec_a = embed("Python programming language")
vec_b = embed("Python coding and development")
vec_c = embed("JavaScript web framework")

# Euclidean Distance (lower is more similar)
dist_ab = euclidean_distance(vec_a, vec_b)
dist_ac = euclidean_distance(vec_a, vec_c)
print(f"Euclidean distance (similar texts): {dist_ab:.4f}")
print(f"Euclidean distance (different texts): {dist_ac:.4f}")

# Dot Product (higher is more similar)
dot_ab = dot_product(vec_a, vec_b)
dot_ac = dot_product(vec_a, vec_c)
print(f"\nDot product (similar texts): {dot_ab:.4f}")
print(f"Dot product (different texts): {dot_ac:.4f}")

You can see that Euclidean distance is smaller for the similar pair of texts, while the dot product is larger, both correctly identifying the closer relationship.

Choosing the Right Metric

For most text-based semantic search applications, cosine similarity is the recommended metric. Because modern embedding models produce normalized vectors, their length doesn't carry significant information, and focusing on the angle (direction) gives the most reliable measure of semantic relatedness.

The batch_similarity function provides an efficient way to compare a single query vector against a collection of document vectors using any of the available metrics. This is the foundation of a semantic search system.

from kerb.embedding import embed, embed_batch, batch_similarity

query_text = "cloud computing infrastructure"
documents = [
    "Cloud services and platforms",
    "Infrastructure as a service",
    "Traditional on-premise servers",
    "Mobile app development",
]

query_emb = embed(query_text)
doc_embeddings = embed_batch(documents)

# Compare scores from different metrics
cosine_scores = batch_similarity(query_emb, doc_embeddings, metric="cosine")
print("Cosine Similarity (higher is better):")
for doc, score in zip(documents, cosine_scores):
    print(f"  [{score:.4f}] {doc}")

euclidean_scores = batch_similarity(query_emb, doc_embeddings, metric="euclidean")
print("\nEuclidean Distance (lower is better):")
for doc, score in zip(documents, euclidean_scores):
    print(f"  [{score:.4f}] {doc}")

Notice how both metrics correctly identify "Cloud services and platforms" and "Infrastructure as a service" as the most relevant documents. However, their scoring ranges and interpretations differ. Cosine similarity provides a normalized, intuitive score between -1 and 1, making it easier to set consistent relevance thresholds.

With a solid grasp of embeddings and how to compare them, you are now ready to build your first semantic search function.

Was this section helpful?

References

Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, 2008 (Cambridge University Press) - Covers the Vector Space Model, a fundamental framework for representing documents as vectors and calculating their similarity, central to information retrieval.
Speech and Language Processing (3rd edition draft), Daniel Jurafsky and James H. Martin, 2025 (Stanford University) - Provides an introduction to word embeddings, their semantic properties, and the application of cosine similarity for comparing text meanings.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) - Explains the mathematical foundations of various distance metrics, including Euclidean and Manhattan distances, within the context of statistical learning.
OpenAI Embeddings API, OpenAI, 2024 (OpenAI) - Official guide to OpenAI's embedding models, detailing their characteristics, normalization, and recommended use of cosine similarity for semantic comparisons.