Evaluating the performance of your vector search setup is essential for optimizing the trade-off between relevance (recall/precision) and speed (latency/throughput). This hands-on exercise guides through setting up and executing a comprehensive performance evaluation for an Approximate Nearest Neighbor (ANN) index.We will simulate evaluating an HNSW index, a common choice covered in Chapter 1. We'll focus on measuring Recall@k and query latency while varying a significant tuning parameter, efSearch.PrerequisitesBefore starting, ensure you have a working Python environment with libraries like numpy for numerical operations, pandas for data manipulation, and a library capable of running ANN search (like faiss-cpu or faiss-gpu, annoy, or a client for a vector database). You will also need:Embeddings Dataset: A collection of vector embeddings. For this exercise, assume you have a NumPy array data_vectors of shape (N, D), where N is the number of vectors and D is the dimensionality.Query Set: A smaller set of vectors query_vectors (shape (Q, D)) for which you want to find neighbors.Ground Truth: The true nearest neighbors for each query vector in query_vectors. This is typically pre-computed using an exact k-NN search (e.g., brute-force calculation) on the data_vectors. Assume you have this stored, perhaps as a list of lists or a 2D array ground_truth_indices, where ground_truth_indices[i] contains the indices of the actual top K nearest neighbors for query_vectors[i]. Let K be the number of neighbors we care about (e.g., K=10).Step 1: Setting Up the IndexFirst, we need to build the HNSW index using our chosen library. Here's an example using Faiss syntax:import faiss import numpy as np import time # Assume data_vectors is loaded (N, D) N, D = data_vectors.shape # HNSW Parameters M = 32 # Number of connections per node efConstruction = 100 # Construction quality/speed trade-off # Build the HNSW index index = faiss.IndexHNSWFlat(D, M) index.hnsw.efConstruction = efConstruction print("Building index...") start_time = time.time() index.add(data_vectors) build_time = time.time() - start_time print(f"Index built in {build_time:.2f} seconds.") # Note: For production, you'd typically save/load the index # faiss.write_index(index, "my_hnsw_index.faiss") # index = faiss.read_index("my_hnsw_index.faiss")This code snippet initializes an HNSW index for vectors of dimension D with M connections per layer during construction. efConstruction influences the quality and time taken for index building.Step 2: Defining the Evaluation LoopNow, we'll loop through different values of the search-time parameter efSearch. This parameter controls the size of the dynamic list maintained during the search phase in HNSW. Higher values generally lead to better recall but increased latency.We'll measure:Recall@K: The proportion of the true top-K neighbors found within the K results returned by the ANN search.Average Query Latency: The average time taken to perform a single search.Queries Per Second (QPS): The number of queries processed per second.# Assume query_vectors and ground_truth_indices are loaded # ground_truth_indices should contain indices for the top K neighbors Q = query_vectors.shape[0] K = ground_truth_indices.shape[1] # Number of neighbors to evaluate (e.g., 10) efSearch_values = [16, 32, 64, 128, 256] # Example values to test results = [] for efSearch in efSearch_values: print(f"\nEvaluating with efSearch = {efSearch}...") index.hnsw.efSearch = efSearch # Set the search parameter all_query_indices = [] start_eval_time = time.time() # Perform searches for all queries # We typically search for K neighbors D_ann, I_ann = index.search(query_vectors, K) end_eval_time = time.time() total_eval_time = end_eval_time - start_eval_time avg_latency_ms = (total_eval_time / Q) * 1000 qps = Q / total_eval_time # Calculate Recall@K total_matches = 0 for i in range(Q): # Ground truth neighbors for query i true_neighbors = set(ground_truth_indices[i]) # Retrieved neighbors for query i retrieved_neighbors = set(I_ann[i]) # Count matches matches = len(true_neighbors.intersection(retrieved_neighbors)) total_matches += matches # Average recall across all queries recall_at_K = total_matches / (Q * K) print(f" Recall@{K}: {recall_at_K:.4f}") print(f" Avg Latency: {avg_latency_ms:.2f} ms") print(f" QPS: {qps:.2f}") results.append({ "efSearch": efSearch, "Recall@K": recall_at_K, "Avg Latency (ms)": avg_latency_ms, "QPS": qps }) # Store results for analysis import pandas as pd results_df = pd.DataFrame(results) print("\nEvaluation Summary:") print(results_df) The formula for Recall@K used here is: $$ \text{Recall@K} = \frac{1}{Q} \sum_{i=1}^{Q} \frac{|\text{ANN_Results}_i \cap \text{GroundTruth}_i|}{K} $$ Where $\text{ANN_Results}_i$ is the set of $K$ indices returned by the ANN search for query $i$, and $\text{GroundTruth}_i$ is the set of $K$ true nearest neighbor indices for query $i$.Step 3: Analyzing the Trade-offThe results_df DataFrame now holds the performance metrics for each efSearch value tested. A common way to visualize the trade-off is to plot Recall@K against Average Latency or QPS.{"data":[{"x":[16,32,64,128,256],"y":[0.85,0.92,0.96,0.98,0.99],"type":"scatter","mode":"lines+markers","name":"Recall@10","marker":{"color":"#228be6"},"line":{"color":"#228be6"}},{"x":[16,32,64,128,256],"y":[1.5,2.5,4.0,7.0,12.5],"type":"scatter","mode":"lines+markers","name":"Avg Latency (ms)","yaxis":"y2","marker":{"color":"#fd7e14"},"line":{"color":"#fd7e14"}}],"layout":{"title":"HNSW Performance: Recall@10 vs. Latency","xaxis":{"title":"efSearch Parameter"},"yaxis":{"title":"Recall@10","range":[0.8,1.0],"side":"left","color":"#228be6"},"yaxis2":{"title":"Avg Latency (ms)","overlaying":"y","side":"right","color":"#fd7e14","gridcolor":"#dee2e6"},"legend":{"x":0.05,"y":0.95},"template":"plotly_white"}}Recall@10 and Average Query Latency plotted against the HNSW efSearch parameter. This visualization clearly shows the trade-off: higher efSearch increases recall but also increases latency.Alternatively, you can plot Recall@K against QPS:{"data":[{"x":[16,32,64,128,256],"y":[0.85,0.92,0.96,0.98,0.99],"type":"scatter","mode":"lines+markers","name":"Recall@10","marker":{"color":"#228be6"},"line":{"color":"#228be6"}},{"x":[16,32,64,128,256],"y":[667,400,250,143,80],"type":"scatter","mode":"lines+markers","name":"QPS","yaxis":"y2","marker":{"color":"#37b24d"},"line":{"color":"#37b24d"}}],"layout":{"title":"HNSW Performance: Recall@10 vs. Throughput (QPS)","xaxis":{"title":"efSearch Parameter"},"yaxis":{"title":"Recall@10","range":[0.8,1.0],"side":"left","color":"#228be6"},"yaxis2":{"title":"Queries Per Second (QPS)","overlaying":"y","side":"right","color":"#37b24d","gridcolor":"#dee2e6"},"legend":{"x":0.05,"y":0.95},"template":"plotly_white"}}Recall@10 and Queries Per Second (QPS) plotted against the HNSW efSearch parameter. This shows that increasing efSearch improves recall but decreases the system's throughput.Interpretation and Next StepsThe charts generated from this evaluation allow you to make informed decisions about parameter tuning.High Recall Needed: If your application (like certain RAG scenarios) requires very high recall, you might choose a higher efSearch value, accepting the associated increase in latency or decrease in QPS.Low Latency Critical: If low latency is key (e.g., real-time semantic search suggestions), you might select a lower efSearch value, trading off some recall for faster responses.This practical exercise provides a template. You can extend it by:Evaluating different ANN algorithms (IVF, PQ variations).Testing different parameter combinations (e.g., M and efConstruction for HNSW; nlist and nprobe for IVF).Measuring index build time and memory usage.Incorporating precision or other relevance metrics if appropriate.Running evaluations on different hardware (CPU vs. GPU) to understand hardware acceleration impact.Systematic evaluation is fundamental to building effective and efficient vector search systems. By applying these techniques, you can confidently tune your system to meet the specific performance and relevance requirements of your LLM applications.