Having explored the concepts of representing data as vectors and measuring their similarity, let's put this into practice. This hands-on section will guide you through generating vector embeddings for text data using a popular Python library and then comparing these embeddings using the similarity metrics we discussed.
First, ensure you have the necessary libraries installed. We'll primarily use sentence-transformers
for generating embeddings and numpy
along with scipy
for numerical operations and similarity calculations. If you plan to visualize the embeddings (which we'll demonstrate), you'll also need scikit-learn
for dimensionality reduction and plotly
for plotting.
You can install these using pip:
pip install sentence-transformers numpy scipy scikit-learn plotly
The sentence-transformers
library provides easy access to numerous pre-trained models suitable for generating high-quality sentence and text embeddings. These models, often based on transformer architectures like BERT, have been fine-tuned to map semantically similar sentences to nearby vectors in the embedding space.
For this exercise, we'll use all-MiniLM-L6-v2
. It offers a good balance between performance and computational efficiency, making it suitable for experimentation without requiring excessive resources.
Let's load the model in Python:
from sentence_transformers import SentenceTransformer
# Load a pre-trained sentence embedding model
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
print(f"Model '{model_name}' loaded successfully.")
# Output: Model 'all-MiniLM-L6-v2' loaded successfully.
Now, let's define a few sample sentences. Notice how some sentences are semantically similar, while others discuss different topics.
import numpy as np
sentences = [
"The delivery arrived on time.",
"My package was delivered promptly.",
"Vector databases store numerical representations.",
"How can I optimize database queries?",
"The weather today is sunny and warm."
]
# Generate embeddings for the sentences
embeddings = model.encode(sentences)
# Check the dimensions of the embeddings
print(f"Shape of embeddings matrix: {embeddings.shape}")
# Output: Shape of embeddings matrix: (5, 384)
# Display the first few dimensions of the first embedding
print(f"First 5 dimensions of the first sentence embedding:\n{embeddings[0, :5]}")
# Output: First 5 dimensions of the first sentence embedding:
# [-0.0568201 -0.01949895 0.00780028 0.03499191 -0.0031617 ]
As you can see, the model.encode()
method takes our list of sentences and returns a NumPy array. The shape (5, 384)
indicates we have 5 embeddings (one for each sentence), and each embedding is a vector of 384 dimensions, as determined by the all-MiniLM-L6-v2
model architecture.
With the embeddings generated, we can now quantify the semantic similarity between pairs of sentences using the metrics discussed earlier, such as Cosine Similarity. Remember, Cosine Similarity measures the cosine of the angle between two vectors. A value closer to 1 indicates high similarity, 0 indicates orthogonality (no similarity), and -1 indicates opposite meaning (though less common with these types of embeddings).
We can use scipy.spatial.distance.cosine
to calculate the cosine distance (which is 1−similarity). We'll then convert it back to similarity.
from scipy.spatial.distance import cosine
# Calculate pairwise cosine similarity
num_sentences = len(sentences)
similarity_matrix = np.zeros((num_sentences, num_sentences))
for i in range(num_sentences):
for j in range(num_sentences):
# Cosine distance = 1 - cosine similarity
similarity = 1 - cosine(embeddings[i], embeddings[j])
similarity_matrix[i, j] = similarity
# Print the similarity matrix (rounded for readability)
print("Pairwise Cosine Similarity Matrix:")
print(np.round(similarity_matrix, 2))
Running this code will produce a matrix like this:
Pairwise Cosine Similarity Matrix:
[[ 1. 0.88 -0.04 -0.02 -0.01]
[ 0.88 1. -0.05 -0.02 0.01]
[-0.04 -0.05 1. 0.46 -0.03]
[-0.02 -0.02 0.46 1. -0.01]
[-0.01 0.01 -0.03 -0.01 1. ]]
Let's analyze the similarity matrix:
This demonstrates the power of embeddings: the numerical vectors capture semantic meaning, and their proximity in the vector space (measured by Cosine Similarity) reflects this meaning.
Visualizing 384-dimensional vectors directly is impossible. However, we can use dimensionality reduction techniques like Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) to project these vectors into 2D or 3D space for visualization. This helps build an intuition for how embeddings cluster based on meaning, although some information is inevitably lost during reduction.
Here's how you might use PCA to reduce to 2 dimensions and plot using Plotly:
from sklearn.decomposition import PCA
import plotly.graph_objects as go
# Reduce dimensions to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
# Create a scatter plot (using pre-calculated example coordinates for consistency)
# Replace these coordinates with the actual output of your PCA transformation
# if you run the code yourself.
example_coords = {
"x": [-0.15, -0.18, 0.10, 0.25, -0.02],
"y": [0.08, 0.10, -0.05, -0.12, 0.01]
}
fig = go.Figure(data=go.Scatter(
# x=embeddings_2d[:, 0], # Use actual PCA output
# y=embeddings_2d[:, 1], # Use actual PCA output
x=example_coords["x"], # Using example values
y=example_coords["y"], # Using example values
mode='markers+text',
marker=dict(
size=10,
color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'], # Example colors
opacity=0.8
),
text=[f"'{s[:20]}...'" for s in sentences], # Display shortened text labels
textposition='top center'
))
fig.update_layout(
title='2D PCA Projection of Sentence Embeddings',
xaxis_title='Principal Component 1',
yaxis_title='Principal Component 2',
width=700,
height=500,
showlegend=False
)
# To display the plot (e.g., in a Jupyter Notebook or save to HTML)
# fig.show()
# Or generate the JSON for web embedding
chart_json = fig.to_json()
print("\nPlotly JSON for embedding visualization:")
# Ensure the JSON output is on a single line for the ```plotly block
print(chart_json.replace("\n", "").replace(" ", ""))
2D visualization of sentence embeddings after PCA reduction. Points closer together represent sentences with higher semantic similarity in the original high-dimensional space. Note how the two delivery sentences are near each other, as are the two database sentences, while the weather sentence is relatively isolated.
This practical exercise demonstrated how to convert text into meaningful numerical representations (embeddings) and how to use vector similarity calculations to compare them. These generated vectors and the ability to quickly find similar ones form the foundation for the vector databases and semantic search systems we will build upon in the following chapters.
© 2025 ApX Machine Learning