Having explored the theoretical underpinnings of HNSW, including its layered graph structure and greedy search heuristic, it's time to translate that theory into practice. This section guides you through implementing an HNSW index using a common Python library, building an index, performing searches, and tuning its parameters to balance performance and accuracy for your specific needs. We assume you have a working Python environment and are familiar with NumPy for numerical operations.Setting Up the EnvironmentFirst, ensure you have the necessary library installed. We'll use hnswlib, a popular and efficient implementation of HNSW. You'll also need numpy for creating vector data.pip install hnswlib numpyWe'll work with synthetic data for this example. Let's generate some random 128-dimensional vectors, simulating typical embedding outputs from language models.import hnswlib import numpy as np import time # Define data parameters dim = 128 # Dimensionality of the vectors num_elements = 10000 # Number of vectors in our dataset # Generate random data (replace with your actual embeddings) np.random.seed(42) # for reproducibility data = np.float32(np.random.random((num_elements, dim))) # Generate unique IDs for each vector data_labels = np.arange(num_elements) # Generate a few query vectors num_queries = 5 query_data = np.float32(np.random.random((num_queries, dim)))Building the HNSW IndexInitializing and populating an HNSW index involves specifying the space (distance metric), dimensionality, and then adding the data points.# Initialize the HNSW index # Possible space options: 'l2', 'ip' (inner product), 'cosine' space_name = 'l2' # Euclidean distance index = hnswlib.Index(space=space_name, dim=dim) # Set index parameters BEFORE adding data # M: max number of connections per node per layer (default 16) # ef_construction: size of dynamic list for neighbors during construction (default 200) # Higher ef_construction leads to better index quality but slower build time M = 16 ef_construction = 200 index.init_index(max_elements=num_elements, ef_construction=ef_construction, M=M) # Add data to the index # ids should be unique integers print("Adding data to the index...") start_time = time.time() index.add_items(data, data_labels) build_time = time.time() - start_time print(f"Index build completed in {build_time:.2f} seconds.") # Optional: Control the number of threads used for indexing # index.set_num_threads(4) # Use 4 threadsHere, init_index prepares the index structure. max_elements should be at least the number of elements you intend to add. M controls the density of the graph connections, affecting memory usage and search quality. ef_construction influences the thoroughness of the search for neighbours during the build process; higher values generally lead to better recall but increase indexing time.Performing Approximate Nearest Neighbor SearchesOnce the index is built, you can perform searches using the knn_query method.# Set search-time parameter ef_search # ef_search: size of dynamic list for neighbors during search # Higher ef_search leads to better recall but slower search time # Must be >= k (number of neighbors requested) ef_search = 50 k = 10 # Number of nearest neighbors to retrieve print(f"\nPerforming k-NN search for {num_queries} queries (k={k}, ef_search={ef_search})...") index.set_ef(ef_search) # Set ef_search for the query session start_time = time.time() labels, distances = index.knn_query(query_data, k=k) search_time = time.time() - start_time print(f"Search completed in {search_time:.4f} seconds.") print(f"Average query time: {search_time / num_queries:.4f} seconds.") # Display results for the first query print("\nResults for the first query:") print("Labels:", labels[0]) print("Distances:", distances[0])The knn_query method returns two arrays: labels containing the integer IDs of the approximate nearest neighbors, and distances containing the corresponding calculated distances (e.g., L2 distance in our case). The critical parameter here is ef_search (set via set_ef). It controls the size of the candidate list during the search traversal. Increasing ef_search makes the search more exhaustive, potentially improving accuracy (recall) at the cost of higher latency. It must be at least k.Tuning HNSW Parameters: The Recall/Speed Trade-offThe core challenge in using ANN algorithms like HNSW is balancing accuracy (recall) and performance (query latency, build time, memory usage). The primary tuning knobs are:M (Build time): Defines the maximum number of outgoing connections for nodes in the graph layers (except layer 0). Higher M creates denser graphs, potentially improving recall and allowing for better navigation, but significantly increases memory footprint and build time. Typical values range from 8 to 64.ef_construction (Build time): Controls the quality of the index construction. Higher values mean a more thorough search for neighbours when inserting elements, leading to a potentially better-structured graph (higher recall) but longer build times. Typical values range from 100 to 2000 or more, depending on the dataset and desired quality.ef_search (Search time): Controls the trade-off between search speed and recall at query time. It determines the size of the priority queue used during graph traversal. Higher values increase the chance of finding the true nearest neighbors but slow down the query. This is often the most frequently tuned parameter after the index is built. It must be at least k.A Practical Tuning Strategy:Establish Ground Truth: For rigorous tuning, you need ground truth: the actual $k$ nearest neighbors for your query set. This usually requires running an exact, brute-force search. This is feasible for smaller datasets or representative samples.# Example: Finding exact neighbors (computationally expensive!) # from sklearn.neighbors import NearestNeighbors # print("Calculating ground truth (this might take a while)...") # nbrs = NearestNeighbors(n_neighbors=k, algorithm='brute', metric='euclidean').fit(data) # true_distances, true_indices = nbrs.kneighbors(query_data)Define Metrics: Measure:Recall@k: The proportion of true $k$ nearest neighbors found by HNSW among its top $k$ results, averaged over all queries. $$ Recall@k = \frac{1}{N_q} \sum_{i=1}^{N_q} \frac{| \text{ANN}_k(q_i) \cap \text{True}_k(q_i) |}{k} $$ Where $N_q$ is the number of queries, $\text{ANN}_k(q_i)$ are the $k$ neighbors returned by HNSW for query $q_i$, and $\text{True}_k(q_i)$ are the true $k$ neighbors.Query Latency: Average time taken per query.Index Size: Memory required to store the index.Build Time: Time taken to construct the index.Iterative Tuning:Fix M and ef_construction (start with defaults like M=16, ef_construction=200). Build the index.Vary ef_search (e.g., from $k$ up to a few hundred). For each value, measure Recall@k and average query latency.Plot Recall@k vs. Latency. This curve clearly shows the trade-off for the current index configuration.If the desired recall cannot be reached even with high ef_search, or if the latency is too high, rebuild the index with increased ef_construction or M. Be mindful that increasing these parameters increases build time and potentially index size (especially M).Repeat the ef_search sweep and evaluation.Example Visualization:Let's simulate results for different ef_search values and plot the trade-off. Assume we have calculated recall against some ground truth.# Simulated tuning results (replace with actual measurements) ef_values = [10, 20, 50, 100, 200, 400] recalls = [0.75, 0.85, 0.92, 0.96, 0.98, 0.99] # Example recall values latencies_ms = [0.5, 0.8, 1.5, 2.8, 5.0, 9.5] # Example latencies in milliseconds import json chart_data = { "layout": { "title": "HNSW Tuning: Recall@10 vs. Query Latency", "xaxis": {"title": "Average Query Latency (ms)"}, "yaxis": {"title": "Recall@10", "range": [0.7, 1.0]}, "legend": {"title": "ef_search"}, "margin": {"l": 50, "r": 50, "t": 50, "b": 50}, "width": 600, "height": 400 }, "data": [ { "x": latencies_ms, "y": recalls, "mode": "lines+markers", "type": "scatter", "name": "Recall", "text": [f"ef={ef}" for ef in ef_values], # Text displayed on hover "marker": {"color": "#339af0", "size": 8}, "line": {"color": "#339af0", "width": 2} } ] } # Display chart (embedding JSON in markdown) print("```plotly") print(json.dumps(chart_data)) print("```"){"layout": {"title": "HNSW Tuning: Recall@10 vs. Query Latency", "xaxis": {"title": "Average Query Latency (ms)"}, "yaxis": {"title": "Recall@10", "range": [0.7, 1.0]}, "legend": {"title": "ef_search"}, "margin": {"l": 50, "r": 50, "t": 50, "b": 50}, "width": 600, "height": 400}, "data": [{"x": [0.5, 0.8, 1.5, 2.8, 5.0, 9.5], "y": [0.75, 0.85, 0.92, 0.96, 0.98, 0.99], "mode": "lines+markers", "type": "scatter", "name": "Recall", "text": ["ef=10", "ef=20", "ef=50", "ef=100", "ef=200", "ef=400"], "marker": {"color": "#339af0", "size": 8}, "line": {"color": "#339af0", "width": 2}}]}The plot illustrates how increasing ef_search typically improves recall but also increases query latency. The optimal ef_search depends on the application's specific requirements for accuracy and speed.Saving and Loading the IndexFor production use, you'll want to save your tuned index and load it without rebuilding.# Save the index to disk index_path = 'my_hnsw_index.bin' print(f"\nSaving index to {index_path}...") index.save_index(index_path) print("Index saved.") # Load the index (in a new session or script) # Need to know dim and space_name used during creation loaded_index = hnswlib.Index(space=space_name, dim=dim) print(f"\nLoading index from {index_path}...") loaded_index.load_index(index_path) print("Index loaded.") # You can now set ef and query the loaded index loaded_index.set_ef(ef_search) labels_loaded, distances_loaded = loaded_index.knn_query(query_data, k=k) # Verify results are the same assert np.array_equal(labels, labels_loaded), "Loaded index results differ!" print("Loaded index query successful and matches original.")This practical exercise demonstrates the fundamental steps of using HNSW: initialization, building, querying, and the critical process of tuning parameters like M, ef_construction, and especially ef_search to achieve the desired balance between search accuracy and performance efficiency for your vector search application. Remember that optimal parameters are dataset-dependent and require careful evaluation based on your specific use case requirements.