Embeddings provide an effective numerical representation of text meaning. To discover related content, you must determine the 'closeness' or 'similarity' between these numerical vectors. This measurement is performed using vector similarity metrics.
Think of each embedding as a point in a high-dimensional space. In this space, texts with similar meanings are located close to each other, while dissimilar texts are far apart. Similarity metrics are mathematical functions that calculate the distance or angular relationship between these points.
Related terms are grouped closely together in the vector space, forming distinct clusters based on their semantic meaning.
Several metrics exist, but for text embeddings, one stands out as the most common and effective: cosine similarity.
Cosine similarity measures the cosine of the angle between two vectors. This metric is particularly well-suited for text embeddings because it focuses on the direction of the vectors rather than their magnitude. The direction captures the semantic content, while the magnitude can sometimes be less important, especially since most modern embedding models produce normalized vectors (vectors with a length of 1).
The intuition is simple:
The formula for cosine similarity between two vectors and is:
When the vectors are normalized to have a length of 1, the denominator becomes 1, and the cosine similarity simplifies to just the dot product of the two vectors, .
You can calculate this easily using the cosine_similarity function. Let's compare the similarity between related and unrelated words.
from kerb.embedding import embed, cosine_similarity
# Generate embeddings for two similar words
vec1 = embed("Hello")
vec2 = embed("Hi")
sim_related = cosine_similarity(vec1, vec2)
print(f"Similarity between 'Hello' and 'Hi': {sim_related:.4f}")
# Generate embeddings for two unrelated words
vec3 = embed("Car")
sim_unrelated = cosine_similarity(vec1, vec3)
print(f"Similarity between 'Hello' and 'Car': {sim_unrelated:.4f}")
As expected, the similarity score for "Hello" and "Hi" is high, while the score for "Hello" and "Car" is very low, reflecting their lack of semantic connection.
While cosine similarity is the most common, the toolkit provides other metrics that can be useful in different contexts. Unlike cosine similarity, these are distance metrics, where a lower value indicates greater similarity.
Euclidean distance is the straight-line distance between two points in the vector space. It considers both the direction and magnitude of the vectors. A distance of 0 means the vectors are identical.
Manhattan distance, also known as "city block" distance, is the sum of the absolute differences of the vectors' components. It's often less sensitive to outliers than Euclidean distance.
The dot product measures how much one vector "goes into" another. For normalized vectors, it's identical to cosine similarity. However, for unnormalized vectors, it is sensitive to vector magnitude, meaning longer vectors can have a disproportionately large influence.
Let's see how these metrics compare for the same set of texts.
from kerb.embedding import embed, euclidean_distance, manhattan_distance, dot_product
vec_a = embed("Python programming language")
vec_b = embed("Python coding and development")
vec_c = embed("JavaScript web framework")
# Euclidean Distance (lower is more similar)
dist_ab = euclidean_distance(vec_a, vec_b)
dist_ac = euclidean_distance(vec_a, vec_c)
print(f"Euclidean distance (similar texts): {dist_ab:.4f}")
print(f"Euclidean distance (different texts): {dist_ac:.4f}")
# Dot Product (higher is more similar)
dot_ab = dot_product(vec_a, vec_b)
dot_ac = dot_product(vec_a, vec_c)
print(f"\nDot product (similar texts): {dot_ab:.4f}")
print(f"Dot product (different texts): {dot_ac:.4f}")
You can see that Euclidean distance is smaller for the similar pair of texts, while the dot product is larger, both correctly identifying the closer relationship.
For most text-based semantic search applications, cosine similarity is the recommended metric. Because modern embedding models produce normalized vectors, their length doesn't carry significant information, and focusing on the angle (direction) gives the most reliable measure of semantic relatedness.
The batch_similarity function provides an efficient way to compare a single query vector against a collection of document vectors using any of the available metrics. This is the foundation of a semantic search system.
from kerb.embedding import embed, embed_batch, batch_similarity
query_text = "cloud computing infrastructure"
documents = [
"Cloud services and platforms",
"Infrastructure as a service",
"Traditional on-premise servers",
"Mobile app development",
]
query_emb = embed(query_text)
doc_embeddings = embed_batch(documents)
# Compare scores from different metrics
cosine_scores = batch_similarity(query_emb, doc_embeddings, metric="cosine")
print("Cosine Similarity (higher is better):")
for doc, score in zip(documents, cosine_scores):
print(f" [{score:.4f}] {doc}")
euclidean_scores = batch_similarity(query_emb, doc_embeddings, metric="euclidean")
print("\nEuclidean Distance (lower is better):")
for doc, score in zip(documents, euclidean_scores):
print(f" [{score:.4f}] {doc}")
Notice how both metrics correctly identify "Cloud services and platforms" and "Infrastructure as a service" as the most relevant documents. However, their scoring ranges and interpretations differ. Cosine similarity provides a normalized, intuitive score between -1 and 1, making it easier to set consistent relevance thresholds.
With a solid grasp of embeddings and how to compare them, you are now ready to build your first semantic search function.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with