Quantization techniques, such as Product Quantization (PQ), compress vectors to save memory and accelerate search. Metadata filtering narrows down the search space for increased relevance and potentially speed. This exercise provides a hands-on application of these techniques in a simulated environment. The goal is to demonstrate how these optimizations interact and affect search outcomes.We'll assume you have a development environment with libraries like numpy for numerical operations and a vector search library such as faiss. While specific code varies between libraries, the principles demonstrated here are broadly applicable.Scenario SetupImagine we're building a semantic search system for a corpus of technical articles. Each article has a vector embedding representing its content and associated metadata, including a publication year and a primary category (e.g., 'AI', 'Data Science', 'Software Engineering').Our goal is to find articles semantically similar to a query vector, but only within a specific category and published after a certain year. We need this search to be fast and memory-efficient.Let's simulate some data:import numpy as np import faiss import time # --- Configuration --- d = 128 # Vector dimension n_total = 100000 # Total number of vectors in the index n_query = 10 # Number of query vectors k = 5 # Number of nearest neighbors to retrieve # --- Generate Sample Data --- print("Generating sample data...") np.random.seed(42) db_vectors = np.random.random((n_total, d)).astype('float32') query_vectors = np.random.random((n_query, d)).astype('float32') # Generate random metadata (example) categories = ['AI', 'Data Science', 'Software Engineering', 'Cloud Computing'] years = [2020, 2021, 2022, 2023] metadata = [ { 'id': i, 'category': np.random.choice(categories), 'year': np.random.choice(years) } for i in range(n_total) ] # Create a mapping for quick metadata lookup by ID metadata_map = {item['id']: item for item in metadata} print(f"Generated {n_total} vectors of dimension {d}.") print(f"Sample metadata[0]: {metadata[0]}") # --- Target Filter Criteria (Example) --- target_category = 'AI' target_min_year = 2022Baseline: Exhaustive Search (No Optimization)First, let's establish a baseline using an exact search index (IndexFlatL2 in Faiss). This provides perfect recall but can be slow and memory-intensive for large datasets.# --- Baseline: Flat Index (Exact Search) --- print("\n--- Baseline: Flat Index ---") index_flat = faiss.IndexFlatL2(d) index_flat.add(db_vectors) start_time = time.time() D_flat, I_flat = index_flat.search(query_vectors, k) end_time = time.time() print(f"Flat index search time: {end_time - start_time:.4f} seconds") print(f"Sample results (indices): {I_flat[0]}") # Note: This baseline doesn't include filtering yet. # Filtering would happen *after* retrieving these K results (post-filtering).Applying Product Quantization (PQ)Now, let's apply PQ to compress vectors and speed up the search. We'll use an IndexIVFPQ structure, which combines Inverted Files (IVF) for partitioning with PQ for compression within each partition.Important parameters for IndexIVFPQ:nlist: Number of IVF partitions (Voronoi cells). A common starting point is sqrt(n_total).M: Number of sub-quantizers for PQ. Divides the vector space. Must be a divisor of d.nbits: Number of bits per sub-quantizer code. Usually 8 (resulting in 256 centroids per sub-space).# --- Applying Quantization: IVFPQ --- print("\n--- Optimization 1: IVFPQ Index ---") # Parameters nlist = int(np.sqrt(n_total)) # Number of clusters for IVF M = 8 # Number of sub-quantizers for PQ (d=128 is divisible by M=8) nbits = 8 # Bits per sub-vector index # Build the quantizer (used for IVF clustering) and the index quantizer = faiss.IndexFlatL2(d) index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits) # Train the index on the database vectors print(f"Training IVFPQ index (nlist={nlist}, M={M}, nbits={nbits})...") # Training requires a representative sample, often the data itself if feasible # For very large datasets, use a subset for training. index_ivfpq.train(db_vectors) print("Training complete.") # Add vectors to the index print("Adding vectors to IVFPQ index...") index_ivfpq.add(db_vectors) print("Adding complete.") # --- Perform Search with IVFPQ --- # 'nprobe' controls how many IVF partitions to search. Higher = more accurate, slower. index_ivfpq.nprobe = 10 # Example value, needs tuning start_time = time.time() D_ivfpq, I_ivfpq = index_ivfpq.search(query_vectors, k) end_time = time.time() print(f"IVFPQ search time (nprobe={index_ivfpq.nprobe}): {end_time - start_time:.4f} seconds") print(f"Sample results (indices): {I_ivfpq[0]}") # Calculate recall (approximate, comparing against Flat results for the first query) # Note: Proper recall requires ground truth from the *entire* dataset, # which is computationally expensive. This is a simplified check. common_results = np.intersect1d(I_flat[0], I_ivfpq[0]) recall_approx = len(common_results) / k print(f"Approximate recall@ {k} for first query (vs Flat): {recall_approx:.2f}")You should observe a significant speedup compared to the flat index, potentially at the cost of some recall. Tuning nprobe allows you to trade speed for accuracy.Implementing Filtering StrategiesNow, let's incorporate our metadata filters: category == 'AI' and year >= 2022.Strategy 1: Post-filteringThis is the simplest approach. We perform the ANN search first (potentially retrieving more than k results to compensate for filtering losses) and then filter the results based on metadata.# --- Optimization 2a: IVFPQ + Post-filtering --- print("\n--- Optimization 2a: IVFPQ + Post-filtering ---") k_oversample = k * 5 # Retrieve more results initially to account for filtering index_ivfpq.nprobe = 10 # Reset nprobe if needed start_time_search = time.time() D_oversample, I_oversample = index_ivfpq.search(query_vectors, k_oversample) end_time_search = time.time() # --- Filtering Step --- start_time_filter = time.time() final_results_post = [] final_distances_post = [] for i in range(n_query): # For each query filtered_indices = [] filtered_distances = [] for j in range(k_oversample): doc_id = I_oversample[i, j] if doc_id < 0: continue # Faiss might return -1 for invalid results meta = metadata_map.get(doc_id) if meta and meta['category'] == target_category and meta['year'] >= target_min_year: filtered_indices.append(doc_id) filtered_distances.append(D_oversample[i, j]) if len(filtered_indices) == k: # Stop once we have enough results break final_results_post.append(filtered_indices) final_distances_post.append(filtered_distances) end_time_filter = time.time() total_time = (end_time_search - start_time_search) + (end_time_filter - start_time_filter) print(f"IVFPQ search time: {end_time_search - start_time_search:.4f} seconds") print(f"Post-filtering time: {end_time_filter - start_time_filter:.4f} seconds") print(f"Total time (search + post-filter): {total_time:.4f} seconds") print(f"Sample filtered results (indices): {final_results_post[0]}")Post-filtering is straightforward but performs potentially unnecessary work by fetching and checking metadata for candidates that might have been excluded earlier. The overhead increases with k_oversample.Strategy 2: Pre-filtering (Simulated)Efficient pre-filtering often requires integration with the indexing structure itself or the underlying database. Vector search libraries like Faiss offer mechanisms like IDSelector that allow searching only a subset of specified IDs. We can simulate the effect of pre-filtering by first identifying the document IDs that match our metadata criteria and then instructing the search library (if possible) to only consider these IDs.# --- Optimization 2b: IVFPQ + Pre-filtering (Simulated with IDSelector) --- print("\n--- Optimization 2b: IVFPQ + Pre-filtering (Simulated) ---") start_time_id_select = time.time() # Step 1: Identify IDs matching the metadata filter eligible_ids_list = [ item['id'] for item in metadata if item['category'] == target_category and item['year'] >= target_min_year ] eligible_ids = np.array(eligible_ids_list, dtype='int64') # Faiss often requires int64 IDs # If the number of eligible IDs is very small, exact search might be faster if len(eligible_ids) == 0: print("No documents match the filter criteria.") # Handle empty result case elif len(eligible_ids) < 1000: # Threshold depends on system/data print(f"Found {len(eligible_ids)} matching IDs. Potentially use exact search on subset.") # Implementation for exact search on subset omitted for brevity else: print(f"Found {len(eligible_ids)} matching IDs. Using IDSelector with IVFPQ.") # Step 2: Use Faiss IDSelector id_selector = faiss.IDSelectorArray(eligible_ids) # Step 3: Perform search with the selector # Note: The `search` function needs context/parameters passed, including the selector # Faiss syntax: index.search(query, k, params=SearchParameters) where params includes the selector # Simplified call structure for illustration (actual API might differ slightly) search_params = faiss.SearchParametersIVF(sel=id_selector, nprobe=index_ivfpq.nprobe) end_time_id_select = time.time() start_time_search = time.time() # The actual search call may need adjustment based on Faiss version / specific index type # We use the generic search method here, assuming it respects the params object D_pre, I_pre = index_ivfpq.search(query_vectors, k, params=search_params) end_time_search = time.time() total_time = (end_time_id_select - start_time_id_select) + (end_time_search - start_time_search) print(f"ID selection time: {end_time_id_select - start_time_id_select:.4f} seconds") print(f"IVFPQ search time (with pre-filter): {end_time_search - start_time_search:.4f} seconds") print(f"Total time (ID select + search): {total_time:.4f} seconds") print(f"Sample filtered results (indices): {I_pre[0]}") # Caveat: The performance benefit of pre-filtering highly depends on the selectivity # of the filter and the efficiency of the IDSelector implementation within the index. # If the filter selects a large fraction of the dataset, the benefit might be small.Pre-filtering, when implemented efficiently within the index, can significantly reduce search time by avoiding computations for irrelevant vectors. However, it requires the index structure or database to support this capability effectively. The initial cost of identifying eligible IDs also needs consideration.Performance ComparisonLet's visualize the trade-offs we observed. We'll plot approximate latency against recall (or a proxy for it) for the different approaches.{"layout": {"title": "Latency vs. Recall Trade-off (Illustrative)", "xaxis": {"title": "Approximate Recall"}, "yaxis": {"title": "Latency (ms)", "type": "log"}, "legend": {"title": "Method"}}, "data": [{"x": [1.0], "y": [150], "mode": "markers", "type": "scatter", "name": "Flat Index", "marker": {"color": "#1c7ed6", "size": 12}}, {"x": [0.85], "y": [15], "mode": "markers", "type": "scatter", "name": "IVFPQ (nprobe=10)", "marker": {"color": "#74b816", "size": 12}}, {"x": [0.95], "y": [45], "mode": "markers", "type": "scatter", "name": "IVFPQ (nprobe=30)", "marker": {"color": "#f59f00", "size": 12}}, {"x": [0.85], "y": [18], "mode": "markers", "type": "scatter", "name": "IVFPQ (nprobe=10) + PostFilter", "marker": {"color": "#ae3ec9", "size": 12}}, {"x": [0.85], "y": [12], "mode": "markers", "type": "scatter", "name": "IVFPQ (nprobe=10) + PreFilter", "marker": {"color": "#f03e3e", "size": 12}}]}Illustrative comparison of different optimization strategies. Values are examples and actual performance depends heavily on data, hardware, and parameter tuning. Pre-filtering assumes efficient implementation.This chart highlights the typical behavior:Flat Index: Highest recall (1.0 by definition), highest latency.IVFPQ: Lower latency but reduced recall. Increasing nprobe improves recall but increases latency.Post-filtering: Adds a small latency overhead to the base IVFPQ search for the filtering step. Recall depends on the base search recall and k_oversample.Pre-filtering: Can potentially offer the lowest latency for the same recall level as the corresponding IVFPQ search, if the filter is selective and the implementation is efficient.ConclusionThis practice session demonstrated how quantization (specifically IVFPQ) drastically reduces search latency compared to exhaustive search, albeit with an accuracy trade-off controlled by parameters like nprobe. We also implemented and contrasted post-filtering and (simulated) pre-filtering. Post-filtering is simple but potentially inefficient, while pre-filtering, when supported effectively, can prune the search space early, leading to better performance, especially with selective filters.Choosing the right combination involves understanding your application's specific requirements for speed, memory usage, and accuracy, and carefully tuning parameters like nprobe, M, nbits, and k_oversample based on empirical evaluation. The techniques covered here are foundational for building responsive and scalable LLM applications relying on vector search.