Okay, let's transition from the theoretical understanding of hybrid search to a practical implementation. In this section, we'll construct a basic hybrid search pipeline using Python, combining keyword search (specifically BM25) with vector similarity search. This hands-on exercise demonstrates how to integrate these distinct retrieval methods and fuse their results for improved relevance.
We assume you have a working Python environment with libraries for vector manipulation and potentially a vector database client or library like Faiss. You'll also need libraries for keyword search.
Before we begin, ensure you have the necessary libraries installed. For this example, we'll use rank_bm25
for keyword search, sentence-transformers
for embedding generation (though we'll assume vectors are pre-computed for the search phase), and numpy
.
pip install rank_bm25 numpy sentence-transformers
We'll work with a hypothetical dataset of documents, where each document has an ID, text content, and a pre-computed vector embedding.
# Sample data structure (replace with your actual data loading)
documents = [
{"id": "doc1", "text": "Introduction to machine learning concepts.", "vector": [0.1, 0.2, ..., 0.5]},
{"id": "doc2", "text": "Advanced deep learning techniques for NLP.", "vector": [0.3, 0.1, ..., 0.7]},
{"id": "doc3", "text": "A guide to Python programming basics.", "vector": [0.8, 0.7, ..., 0.1]},
{"id": "doc4", "text": "Optimizing machine learning models.", "vector": [0.2, 0.3, ..., 0.6]},
# ... more documents
]
# Simulate loading pre-computed vectors into a structure for lookup
vector_index_data = {doc["id"]: doc["vector"] for doc in documents}
document_texts = {doc["id"]: doc["text"] for doc in documents}
# Example query
query_text = "machine learning optimization"
# Assume we have a function to get query embedding
# In a real scenario, you'd use the same model as for document vectors
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2') # Example model
def get_query_embedding(query):
return embedding_model.encode(query)
query_vector = get_query_embedding(query_text)
First, let's set up the BM25 component. We'll index the text content of our documents. The rank_bm25
library provides a straightforward implementation.
import numpy as np
from rank_bm25 import BM25Okapi
# Tokenize the documents (simple whitespace splitting for this example)
tokenized_corpus = [doc["text"].lower().split() for doc in documents]
doc_ids = [doc["id"] for doc in documents]
# Create the BM25 index
bm25 = BM25Okapi(tokenized_corpus)
def perform_keyword_search(query_text, top_n=10):
"""Performs BM25 search and returns ranked document IDs and scores."""
tokenized_query = query_text.lower().split()
# Get scores for all documents
doc_scores = bm25.get_scores(tokenized_query)
# Get the indices of the top N scores
top_indices = np.argsort(doc_scores)[::-1][:top_n]
# Map indices back to document IDs and create results list
results = []
for index in top_indices:
if doc_scores[index] > 0: # Only include docs with non-zero score
results.append({"id": doc_ids[index], "score": doc_scores[index]})
return results
# Perform keyword search
keyword_results = perform_keyword_search(query_text, top_n=5)
print("Keyword Search Results (BM25):")
print(keyword_results)
# Expected Output (example):
# Keyword Search Results (BM25):
# [{'id': 'doc4', 'score': 0.937...}, {'id': 'doc1', 'score': 0.937...}, {'id': 'doc2', 'score': 0.0}] # Note: Scores depend heavily on corpus
This gives us a list of document IDs ranked by their BM25 relevance score to the query.
Next, we simulate the vector search component. In a real system, this would involve querying an ANN index (like HNSW in Faiss, OpenSearch, Pinecone, Weaviate, etc.) built from the vector_index_data
. Here, we'll simulate it by calculating cosine similarity against all vectors (inefficient for large datasets, but sufficient for demonstrating the hybrid concept).
from numpy.linalg import norm
def cosine_similarity(vec_a, vec_b):
"""Computes cosine similarity between two vectors."""
return np.dot(vec_a, vec_b) / (norm(vec_a) * norm(vec_b))
def perform_vector_search(query_vector, index_data, top_n=10):
"""Performs vector search and returns ranked document IDs and scores."""
scores = []
for doc_id, doc_vector in index_data.items():
similarity = cosine_similarity(query_vector, doc_vector)
scores.append({"id": doc_id, "score": similarity})
# Sort by score descending
scores.sort(key=lambda x: x["score"], reverse=True)
return scores[:top_n]
# Perform vector search
vector_results = perform_vector_search(query_vector, vector_index_data, top_n=5)
print("\nVector Search Results:")
print(vector_results)
# Expected Output (example):
# Vector Search Results:
# [{'id': 'doc4', 'score': 0.85...}, {'id': 'doc1', 'score': 0.78...}, {'id': 'doc2', 'score': 0.65...}, {'id': 'doc3', 'score': 0.12...}]
Now we have a second list of document IDs, ranked by semantic similarity. Notice that the scores are in a different range (cosine similarity typically [-1, 1] or [0, 1], BM25 usually positive and unbounded) and the ranking might differ significantly from the keyword results.
As discussed previously, simply adding or averaging scores is problematic due to different scales and distributions. Reciprocal Rank Fusion (RRF) offers a robust way to combine rankings without needing score normalization.
The RRF score for a document d is calculated as: ScoreRRF(d)=∑i∈Result Listsk+ranki(d)1 Where ranki(d) is the rank of document d in result list i (starting from 1), and k is a constant controlling the influence of lower-ranked items. A common value for k is 60.
Let's implement the RRF function:
def apply_rrf(results_lists, k=60):
"""Applies Reciprocal Rank Fusion to combine multiple ranked lists."""
fused_scores = {}
doc_data = {} # To store original scores if needed, or just the ID
for results in results_lists:
for rank, result in enumerate(results):
doc_id = result["id"]
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
doc_data[doc_id] = {"id": doc_id} # Store associated data
# Calculate RRF contribution for this list
rrf_score = 1.0 / (k + rank + 1) # Rank is 0-based, add 1
fused_scores[doc_id] += rrf_score
# Combine IDs and scores, then sort
fused_results = [{"id": doc_id, "score": fused_scores[doc_id]} for doc_id in fused_scores]
fused_results.sort(key=lambda x: x["score"], reverse=True)
return fused_results
# Combine the keyword and vector results
hybrid_results = apply_rrf([keyword_results, vector_results], k=60)
print("\nHybrid Search Results (RRF):")
print(hybrid_results)
# Expected Output (example, depends on input ranks):
# Hybrid Search Results (RRF):
# [{'id': 'doc4', 'score': 0.032...}, {'id': 'doc1', 'score': 0.032...}, {'id': 'doc2', 'score': 0.016...}, {'id': 'doc3', 'score': 0.015...}]
# Note: RRF scores are relative; the ranking is the primary output.
# In this example, doc4 and doc1 appeared high in both lists, getting higher RRF scores.
We can encapsulate the process into a single function representing our hybrid search pipeline.
A diagram illustrating the hybrid search pipeline flow. The user query is used for both keyword search and to generate an embedding for vector search. Results from both are then fused (e.g., using RRF) to produce the final ranked list.
def hybrid_search_pipeline(query_text, top_n=10, rrf_k=60):
"""Executes the full hybrid search pipeline."""
print(f"Performing hybrid search for: '{query_text}'")
# 1. Get Query Embedding
query_vector = get_query_embedding(query_text)
# 2. Perform Keyword Search
keyword_results = perform_keyword_search(query_text, top_n=top_n * 2) # Retrieve more initially
print(f" BM25 returned {len(keyword_results)} results.")
# 3. Perform Vector Search
# Replace placeholder with actual call to your vector search system
vector_results = perform_vector_search(query_vector, vector_index_data, top_n=top_n * 2) # Retrieve more initially
print(f" Vector Search returned {len(vector_results)} results.")
# 4. Apply RRF Fusion
fused_results = apply_rrf([keyword_results, vector_results], k=rrf_k)
print(f" Fused results generated {len(fused_results)} candidates.")
# 5. Return final top N results with text
final_results = []
for result in fused_results[:top_n]:
doc_id = result["id"]
final_results.append({
"id": doc_id,
"text": document_texts.get(doc_id, "N/A"), # Fetch text for context
"rrf_score": result["score"]
})
return final_results
# Example Usage
final_ranked_list = hybrid_search_pipeline(query_text, top_n=5)
print("\nFinal Hybrid Ranked List:")
for i, item in enumerate(final_ranked_list):
print(f"{i+1}. ID: {item['id']}, Score: {item['rrf_score']:.4f}, Text: {item['text'][:80]}...")
k
: The RRF constant k
influences how much weight is given to lower-ranked items. A smaller k
gives more importance to top ranks. Its optimal value might depend on the specific dataset and query patterns and can be tuned via offline evaluation (Chapter 5).top_n * 2
results from each system before fusion. Retrieving more candidates (e.g., top 50 or 100) from each source before fusion generally leads to better final results, at the cost of slightly higher fusion computation.This hands-on exercise provides a foundational blueprint for building hybrid search systems. By combining the strengths of lexical and semantic retrieval, you can often achieve significantly better relevance than either approach alone, particularly for complex information needs common in LLM applications like RAG.
© 2025 ApX Machine Learning