Alright, let's transition from theory to practice. We've discussed how quantization techniques like Product Quantization (PQ) compress vectors to save memory and accelerate search, and how metadata filtering narrows down the search space for increased relevance and potentially speed. Now, we'll work through applying these techniques in a simulated environment. This hands-on exercise aims to solidify your understanding of how these optimizations interact and affect search outcomes.
We'll assume you have a development environment with libraries like numpy
for numerical operations and a vector search library such as faiss
. While specific code varies between libraries, the principles demonstrated here are broadly applicable.
Imagine we're building a semantic search system for a corpus of technical articles. Each article has a vector embedding representing its content and associated metadata, including a publication year and a primary category (e.g., 'AI', 'Data Science', 'Software Engineering').
Our goal is to find articles semantically similar to a query vector, but only within a specific category and published after a certain year. We need this search to be fast and memory-efficient.
Let's simulate some data:
import numpy as np
import faiss
import time
# --- Configuration ---
d = 128 # Vector dimension
n_total = 100000 # Total number of vectors in the index
n_query = 10 # Number of query vectors
k = 5 # Number of nearest neighbors to retrieve
# --- Generate Sample Data ---
print("Generating sample data...")
np.random.seed(42)
db_vectors = np.random.random((n_total, d)).astype('float32')
query_vectors = np.random.random((n_query, d)).astype('float32')
# Generate random metadata (example)
categories = ['AI', 'Data Science', 'Software Engineering', 'Cloud Computing']
years = [2020, 2021, 2022, 2023]
metadata = [
{
'id': i,
'category': np.random.choice(categories),
'year': np.random.choice(years)
} for i in range(n_total)
]
# Create a mapping for quick metadata lookup by ID
metadata_map = {item['id']: item for item in metadata}
print(f"Generated {n_total} vectors of dimension {d}.")
print(f"Sample metadata[0]: {metadata[0]}")
# --- Target Filter Criteria (Example) ---
target_category = 'AI'
target_min_year = 2022
First, let's establish a baseline using an exact search index (IndexFlatL2
in Faiss). This provides perfect recall but can be slow and memory-intensive for large datasets.
# --- Baseline: Flat Index (Exact Search) ---
print("\n--- Baseline: Flat Index ---")
index_flat = faiss.IndexFlatL2(d)
index_flat.add(db_vectors)
start_time = time.time()
D_flat, I_flat = index_flat.search(query_vectors, k)
end_time = time.time()
print(f"Flat index search time: {end_time - start_time:.4f} seconds")
print(f"Sample results (indices): {I_flat[0]}")
# Note: This baseline doesn't include filtering yet.
# Filtering would happen *after* retrieving these K results (post-filtering).
Now, let's apply PQ to compress vectors and speed up the search. We'll use an IndexIVFPQ
structure, which combines Inverted Files (IVF) for partitioning with PQ for compression within each partition.
Key parameters for IndexIVFPQ
:
nlist
: Number of IVF partitions (Voronoi cells). A common starting point is sqrt(n_total)
.M
: Number of sub-quantizers for PQ. Divides the vector space. Must be a divisor of d
.nbits
: Number of bits per sub-quantizer code. Usually 8 (resulting in 256 centroids per sub-space).# --- Applying Quantization: IVFPQ ---
print("\n--- Optimization 1: IVFPQ Index ---")
# Parameters
nlist = int(np.sqrt(n_total)) # Number of clusters for IVF
M = 8 # Number of sub-quantizers for PQ (d=128 is divisible by M=8)
nbits = 8 # Bits per sub-vector index
# Build the quantizer (used for IVF clustering) and the index
quantizer = faiss.IndexFlatL2(d)
index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, M, nbits)
# Train the index on the database vectors
print(f"Training IVFPQ index (nlist={nlist}, M={M}, nbits={nbits})...")
# Training requires a representative sample, often the data itself if feasible
# For very large datasets, use a subset for training.
index_ivfpq.train(db_vectors)
print("Training complete.")
# Add vectors to the index
print("Adding vectors to IVFPQ index...")
index_ivfpq.add(db_vectors)
print("Adding complete.")
# --- Perform Search with IVFPQ ---
# 'nprobe' controls how many IVF partitions to search. Higher = more accurate, slower.
index_ivfpq.nprobe = 10 # Example value, needs tuning
start_time = time.time()
D_ivfpq, I_ivfpq = index_ivfpq.search(query_vectors, k)
end_time = time.time()
print(f"IVFPQ search time (nprobe={index_ivfpq.nprobe}): {end_time - start_time:.4f} seconds")
print(f"Sample results (indices): {I_ivfpq[0]}")
# Calculate recall (approximate, comparing against Flat results for the first query)
# Note: Proper recall requires ground truth from the *entire* dataset,
# which is computationally expensive. This is a simplified check.
common_results = np.intersect1d(I_flat[0], I_ivfpq[0])
recall_approx = len(common_results) / k
print(f"Approximate recall@ {k} for first query (vs Flat): {recall_approx:.2f}")
You should observe a significant speedup compared to the flat index, potentially at the cost of some recall. Tuning nprobe
allows you to trade speed for accuracy.
Now, let's incorporate our metadata filters: category == 'AI'
and year >= 2022
.
This is the simplest approach. We perform the ANN search first (potentially retrieving more than k
results to compensate for filtering losses) and then filter the results based on metadata.
# --- Optimization 2a: IVFPQ + Post-filtering ---
print("\n--- Optimization 2a: IVFPQ + Post-filtering ---")
k_oversample = k * 5 # Retrieve more results initially to account for filtering
index_ivfpq.nprobe = 10 # Reset nprobe if needed
start_time_search = time.time()
D_oversample, I_oversample = index_ivfpq.search(query_vectors, k_oversample)
end_time_search = time.time()
# --- Filtering Step ---
start_time_filter = time.time()
final_results_post = []
final_distances_post = []
for i in range(n_query): # For each query
filtered_indices = []
filtered_distances = []
for j in range(k_oversample):
doc_id = I_oversample[i, j]
if doc_id < 0: continue # Faiss might return -1 for invalid results
meta = metadata_map.get(doc_id)
if meta and meta['category'] == target_category and meta['year'] >= target_min_year:
filtered_indices.append(doc_id)
filtered_distances.append(D_oversample[i, j])
if len(filtered_indices) == k: # Stop once we have enough results
break
final_results_post.append(filtered_indices)
final_distances_post.append(filtered_distances)
end_time_filter = time.time()
total_time = (end_time_search - start_time_search) + (end_time_filter - start_time_filter)
print(f"IVFPQ search time: {end_time_search - start_time_search:.4f} seconds")
print(f"Post-filtering time: {end_time_filter - start_time_filter:.4f} seconds")
print(f"Total time (search + post-filter): {total_time:.4f} seconds")
print(f"Sample filtered results (indices): {final_results_post[0]}")
Post-filtering is straightforward but performs potentially unnecessary work by fetching and checking metadata for candidates that might have been excluded earlier. The overhead increases with k_oversample
.
Efficient pre-filtering often requires integration with the indexing structure itself or the underlying database. Vector search libraries like Faiss offer mechanisms like IDSelector
that allow searching only a subset of specified IDs. We can simulate the effect of pre-filtering by first identifying the document IDs that match our metadata criteria and then instructing the search library (if possible) to only consider these IDs.
# --- Optimization 2b: IVFPQ + Pre-filtering (Simulated with IDSelector) ---
print("\n--- Optimization 2b: IVFPQ + Pre-filtering (Simulated) ---")
start_time_id_select = time.time()
# Step 1: Identify IDs matching the metadata filter
eligible_ids_list = [
item['id'] for item in metadata
if item['category'] == target_category and item['year'] >= target_min_year
]
eligible_ids = np.array(eligible_ids_list, dtype='int64') # Faiss often requires int64 IDs
# If the number of eligible IDs is very small, exact search might be faster
if len(eligible_ids) == 0:
print("No documents match the filter criteria.")
# Handle empty result case
elif len(eligible_ids) < 1000: # Threshold depends on system/data
print(f"Found {len(eligible_ids)} matching IDs. Potentially use exact search on subset.")
# Implementation for exact search on subset omitted for brevity
else:
print(f"Found {len(eligible_ids)} matching IDs. Using IDSelector with IVFPQ.")
# Step 2: Use Faiss IDSelector
id_selector = faiss.IDSelectorArray(eligible_ids)
# Step 3: Perform search with the selector
# Note: The `search` function needs context/parameters passed, including the selector
# Faiss syntax: index.search(query, k, params=SearchParameters) where params includes the selector
# Simplified call structure for illustration (actual API might differ slightly)
search_params = faiss.SearchParametersIVF(sel=id_selector, nprobe=index_ivfpq.nprobe)
end_time_id_select = time.time()
start_time_search = time.time()
# The actual search call may need adjustment based on Faiss version / specific index type
# We use the generic search method here, assuming it respects the params object
D_pre, I_pre = index_ivfpq.search(query_vectors, k, params=search_params)
end_time_search = time.time()
total_time = (end_time_id_select - start_time_id_select) + (end_time_search - start_time_search)
print(f"ID selection time: {end_time_id_select - start_time_id_select:.4f} seconds")
print(f"IVFPQ search time (with pre-filter): {end_time_search - start_time_search:.4f} seconds")
print(f"Total time (ID select + search): {total_time:.4f} seconds")
print(f"Sample filtered results (indices): {I_pre[0]}")
# Caveat: The performance benefit of pre-filtering highly depends on the selectivity
# of the filter and the efficiency of the IDSelector implementation within the index.
# If the filter selects a large fraction of the dataset, the benefit might be small.
Pre-filtering, when implemented efficiently within the index, can significantly reduce search time by avoiding computations for irrelevant vectors. However, it requires the index structure or database to support this capability effectively. The initial cost of identifying eligible IDs also needs consideration.
Let's visualize the trade-offs we observed. We'll plot approximate latency against recall (or a proxy for it) for the different approaches.
Illustrative comparison of different optimization strategies. Values are examples and actual performance depends heavily on data, hardware, and parameter tuning. Pre-filtering assumes efficient implementation.
This chart highlights the typical behavior:
nprobe
improves recall but increases latency.k_oversample
.This practice session demonstrated how quantization (specifically IVFPQ) drastically reduces search latency compared to exhaustive search, albeit with an accuracy trade-off controlled by parameters like nprobe
. We also implemented and contrasted post-filtering and (simulated) pre-filtering. Post-filtering is simple but potentially inefficient, while pre-filtering, when supported effectively, can prune the search space early, leading to better performance, especially with selective filters.
Choosing the right combination involves understanding your application's specific requirements for speed, memory usage, and accuracy, and carefully tuning parameters like nprobe
, M
, nbits
, and k_oversample
based on empirical evaluation. The techniques covered here are foundational for building responsive and scalable LLM applications relying on vector search.
© 2025 ApX Machine Learning