Having explored the concepts of representing data as vectors and measuring their similarity, let's put this into practice. This hands-on section will guide you through generating vector embeddings for text data using a popular Python library and then comparing these embeddings using the similarity metrics we discussed.Setting Up Your EnvironmentFirst, ensure you have the necessary libraries installed. We'll primarily use sentence-transformers for generating embeddings and numpy along with scipy for numerical operations and similarity calculations. If you plan to visualize the embeddings (which we'll demonstrate), you'll also need scikit-learn for dimensionality reduction and plotly for plotting.You can install these using pip:pip install sentence-transformers numpy scipy scikit-learn plotlyChoosing and Loading an Embedding ModelThe sentence-transformers library provides easy access to numerous pre-trained models suitable for generating high-quality sentence and text embeddings. These models, often based on transformer architectures like BERT, have been fine-tuned to map semantically similar sentences to nearby vectors in the embedding space.For this exercise, we'll use all-MiniLM-L6-v2. It offers a good balance between performance and computational efficiency, making it suitable for experimentation without requiring excessive resources.Let's load the model in Python:from sentence_transformers import SentenceTransformer # Load a pre-trained sentence embedding model model_name = 'all-MiniLM-L6-v2' model = SentenceTransformer(model_name) print(f"Model '{model_name}' loaded successfully.") # Output: Model 'all-MiniLM-L6-v2' loaded successfully.Generating EmbeddingsNow, let's define a few sample sentences. Notice how some sentences are semantically similar, while others discuss different topics.import numpy as np sentences = [ "The delivery arrived on time.", "My package was delivered promptly.", "Vector databases store numerical representations.", "How can I optimize database queries?", "The weather today is sunny and warm." ] # Generate embeddings for the sentences embeddings = model.encode(sentences) # Check the dimensions of the embeddings print(f"Shape of embeddings matrix: {embeddings.shape}") # Output: Shape of embeddings matrix: (5, 384) # Display the first few dimensions of the first embedding print(f"First 5 dimensions of the first sentence embedding:\n{embeddings[0, :5]}") # Output: First 5 dimensions of the first sentence embedding: # [-0.0568201 -0.01949895 0.00780028 0.03499191 -0.0031617 ]As you can see, the model.encode() method takes our list of sentences and returns a NumPy array. The shape (5, 384) indicates we have 5 embeddings (one for each sentence), and each embedding is a vector of 384 dimensions, as determined by the all-MiniLM-L6-v2 model architecture.Calculating Semantic SimilarityWith the embeddings generated, we can now quantify the semantic similarity between pairs of sentences using the metrics discussed earlier, such as Cosine Similarity. Remember, Cosine Similarity measures the cosine of the angle between two vectors. A value closer to 1 indicates high similarity, 0 indicates orthogonality (no similarity), and -1 indicates opposite meaning (though less common with these types of embeddings).We can use scipy.spatial.distance.cosine to calculate the cosine distance (which is $1 - \text{similarity}$). We'll then convert it back to similarity.from scipy.spatial.distance import cosine # Calculate pairwise cosine similarity num_sentences = len(sentences) similarity_matrix = np.zeros((num_sentences, num_sentences)) for i in range(num_sentences): for j in range(num_sentences): # Cosine distance = 1 - cosine similarity similarity = 1 - cosine(embeddings[i], embeddings[j]) similarity_matrix[i, j] = similarity # Print the similarity matrix (rounded for readability) print("Pairwise Cosine Similarity Matrix:") print(np.round(similarity_matrix, 2))Running this code will produce a matrix like this:Pairwise Cosine Similarity Matrix: [[ 1. 0.88 -0.04 -0.02 -0.01] [ 0.88 1. -0.05 -0.02 0.01] [-0.04 -0.05 1. 0.46 -0.03] [-0.02 -0.02 0.46 1. -0.01] [-0.01 0.01 -0.03 -0.01 1. ]]Interpreting the ResultsLet's analyze the similarity matrix:Diagonal: The diagonal elements are all 1.0, as each sentence is perfectly similar to itself.High Similarity: Notice the high similarity (0.88) between sentences 0 ("The delivery arrived on time.") and 1 ("My package was delivered promptly."). This aligns with our intuition, as they convey very similar meanings.Moderate Similarity: Sentences 2 ("Vector databases store numerical representations.") and 3 ("How can I optimize database queries?") show a moderate similarity (0.46). Both relate to databases, but discuss different aspects (storage vs. optimization).Low Similarity: The sentence about weather (sentence 4) has very low similarity scores (close to 0) with all other sentences, correctly reflecting its distinct topic. Similarly, the delivery sentences show low similarity to the database sentences.This demonstrates the power of embeddings: the numerical vectors capture semantic meaning, and their proximity in the vector space (measured by Cosine Similarity) reflects this meaning.Visualizing Embeddings (Optional)Visualizing 384-dimensional vectors directly is impossible. However, we can use dimensionality reduction techniques like Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) to project these vectors into 2D or 3D space for visualization. This helps build an intuition for how embeddings cluster based on meaning, although some information is inevitably lost during reduction.Here's how you might use PCA to reduce to 2 dimensions and plot using Plotly:from sklearn.decomposition import PCA import plotly.graph_objects as go # Reduce dimensions to 2D using PCA pca = PCA(n_components=2) embeddings_2d = pca.fit_transform(embeddings) # Create a scatter plot (using pre-calculated example coordinates for consistency) # Replace these coordinates with the actual output of your PCA transformation # if you run the code yourself. example_coords = { "x": [-0.15, -0.18, 0.10, 0.25, -0.02], "y": [0.08, 0.10, -0.05, -0.12, 0.01] } fig = go.Figure(data=go.Scatter( # x=embeddings_2d[:, 0], # Use actual PCA output # y=embeddings_2d[:, 1], # Use actual PCA output x=example_coords["x"], # Using example values y=example_coords["y"], # Using example values mode='markers+text', marker=dict( size=10, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'], # Example colors opacity=0.8 ), text=[f"'{s[:20]}...'" for s in sentences], # Display shortened text labels textposition='top center' )) fig.update_layout( title='2D PCA Projection of Sentence Embeddings', xaxis_title='Principal Component 1', yaxis_title='Principal Component 2', width=700, height=500, showlegend=False ) # To display the plot (e.g., in a Jupyter Notebook or save to HTML) # fig.show() # Or generate the JSON for web embedding chart_json = fig.to_json() print("\nPlotly JSON for embedding visualization:") # Ensure the JSON output is on a single line for the ```plotly block print(chart_json.replace("\n", "").replace(" ", "")){"data":[{"marker":{"color":["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd"],"opacity":0.8,"size":10},"mode":"markers+text","text":["'Thedeliveryarriv...'","'Mypackagewasde...'","'Vectordatabases...'","'HowcanIoptimize...'","'Theweathertoday...'"],"textposition":"topcenter","type":"scatter","x":[-0.15,-0.18,0.1,0.25,-0.02],"y":[0.08,0.1,-0.05,-0.12,0.01]}],"layout":{"height":500,"showlegend":false,"title":{"text":"2DPCAPrjectionofSentenceEmbeddings"},"width":700,"xaxis":{"title":{"text":"PrincipalComponent1"}},"yaxis":{"title":{"text":"PrincipalComponent2"}}}}2D visualization of sentence embeddings after PCA reduction. Points closer together represent sentences with higher semantic similarity in the original high-dimensional space. Note how the two delivery sentences are near each other, as are the two database sentences, while the weather sentence is relatively isolated.This practical exercise demonstrated how to convert text into meaningful numerical representations (embeddings) and how to use vector similarity calculations to compare them. These generated vectors and the ability to quickly find similar ones form the foundation for the vector databases and semantic search systems we will build upon in the following chapters.