Okay, let's transition from understanding why we need hybrid search to how we actually combine the results from disparate systems like vector search and traditional keyword search (e.g., BM25). Running these searches in parallel gives us two lists of potential results, each with its own set of documents and associated scores. The challenge is to merge these into a single, coherently ranked list that reflects relevance based on both semantic similarity and term matching.
The Challenge of Incomparable Scores
A primary obstacle in fusing results is that the scores produced by different retrieval methods are often not directly comparable.
- Vector Search Scores: These typically represent similarity or distance. Cosine similarity scores usually fall within the range [-1, 1] or [0, 1], while Euclidean distances are non-negative ([0,∞)). These scores measure semantic proximity in the embedding space.
- Keyword Search Scores: Algorithms like BM25 produce scores that reflect term frequency, inverse document frequency, and document length. These scores are typically non-negative ([0,∞)) but don't have a fixed upper bound and their distribution can vary significantly depending on the query terms and the corpus statistics.
Attempting to directly add or compare a cosine similarity of 0.85 to a BM25 score of 15.2 is meaningless. They represent different concepts on different scales. Furthermore, the relationship between score and relevance might be non-linear and differ between the two systems. A small increase in cosine similarity might represent a significant relevance gain, while a similar absolute increase in a high BM25 score might be less impactful.
Score distributions for top 10 documents from vector (Cosine Similarity, bottom axis) and keyword (BM25, top axis) search. Higher scores are generally better, but the scales and distributions differ significantly.
Therefore, effective fusion strategies must first address this score incompatibility.
Score Normalization Techniques
One common approach is to normalize the scores from each retrieval system to a common scale, typically [0, 1], before combining them.
-
Min-Max Scaling: This technique linearly scales scores to the [0, 1] range based on the minimum and maximum scores observed within the current result list for each system.
Scorenorm=Scoremax−ScoreminScore−Scoremin
- Pros: Simple to implement, guarantees scores are bounded between 0 and 1.
- Cons: Highly sensitive to outliers (a single very high or low score can compress the range for all other scores), assumes a linear relationship between score and relevance. Requires fetching enough results to establish a reasonable Scoremin and Scoremax.
-
Z-Score Normalization (Standard Score): This method scales scores based on the mean (μ) and standard deviation (σ) of the scores within the result list.
Scorenorm=σScore−μ
- Pros: Less sensitive to extreme outliers compared to Min-Max, accounts for the distribution of scores.
- Cons: Produces scores that are not bounded (typically range from -3 to +3, but can exceed this), requires calculating mean and standard deviation. The resulting scores might still not be directly comparable if the underlying distributions are very different (e.g., highly skewed). Might require further transformation (like applying a sigmoid function) to map to [0, 1].
Calculating these statistics (Scoremin, Scoremax, μ, σ) is usually done per-query based on the retrieved set of results (e.g., top 100 documents from each system).
Weighted Sum Combination
Once scores are normalized (e.g., using Min-Max to get Scorevec_norm and Scorekw_norm), a straightforward way to combine them is through a weighted sum. For a document d retrieved by either or both systems, its hybrid score can be calculated as:
Scorehybrid(d)=wvector×Scorevec_norm(d)+wkeyword×Scorekw_norm(d)
Here:
- Scorevec_norm(d) is the normalized vector score for document d. If d was not found by vector search, this score is 0.
- Scorekw_norm(d) is the normalized keyword score for document d. If d was not found by keyword search, this score is 0.
- wvector and wkeyword are weights determining the relative importance of each search system. Typically, wvector+wkeyword=1.
Determining Weights: The weights wvector and wkeyword are hyperparameters that significantly influence the final ranking. Their optimal values depend heavily on the specific use case, data characteristics, and user expectations.
- For applications where semantic meaning is most important, wvector might be higher.
- For applications where matching specific terms or codes is critical, wkeyword might be higher.
- These weights are often tuned empirically using techniques like:
- Offline Evaluation: Optimize weights based on metrics like NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision) calculated on a labeled dataset (ground truth).
- A/B Testing: Deploy different weight combinations to different user segments and measure online metrics like click-through rate or task success rate.
Rank-Based Fusion Approaches
An alternative to score normalization and weighted sums is to use the rank positions of documents in each list. This approach inherently avoids the issues of comparing scores from different scales. The intuition is that a document's position in a ranked list is a more stable signal of relevance than its raw score. Methods like Reciprocal Rank Fusion (RRF), which we will discuss in the next section, operate on this principle. They combine ranks rather than scores, often proving more resilient to score variations and requiring less tuning than weighted sum methods.
Re-ranking Strategies
Fusion techniques primarily focus on combining the initial result lists. However, you can further refine the ranking by adding a re-ranking step. After obtaining a combined list of, say, the top 100 candidates using a fusion method (like weighted sum or RRF), a more sophisticated, often computationally more expensive, model can be applied to re-order just these candidates.
- Purpose: To leverage richer features or more complex models (that would be too slow to apply to the entire corpus) to improve the precision of the top results.
- Models: This could involve:
- Cross-Encoders: Transformer models that jointly process the query and a candidate document to produce a highly accurate relevance score.
- Learning-to-Rank (LTR) Models: Models (e.g., LambdaMART, Gradient Boosted Trees) trained on features extracted from the query-document pair (including original vector/keyword scores, document metadata, query features, etc.).
- Trade-off: Re-ranking can significantly improve relevance but adds latency to the search process, as the re-ranking model needs to evaluate each of the top N candidates. This is often acceptable if the improved quality of the top few results is highly valued.
Hybrid search workflow: Initial retrieval via vector and keyword search, followed by a fusion step (e.g., normalization and weighted sum) to create a candidate list, and an optional re-ranking step using a more powerful model for final ordering.
Implementation Considerations
- Handling Overlaps: When a document appears in both the vector and keyword result lists, you need a consistent way to handle it during fusion. With weighted sums, you simply add the weighted normalized scores. With rank-based methods (like RRF), the presence in multiple lists contributes naturally to the fusion formula.
- Performance: Score normalization and weighted sum calculations are generally very fast compared to the initial search operations. Rank-based fusion methods are also typically lightweight. The main performance consideration comes from adding a re-ranking step, which introduces additional latency proportional to the number of candidates being re-ranked and the complexity of the re-ranking model.
Choosing the right fusion and ranking strategy involves balancing relevance requirements, computational budget (latency, cost), and the effort required for tuning and maintenance. Simple weighted sums are easier to start with but may require careful tuning, while rank-based methods can be more robust, and re-ranking offers the potential for higher accuracy at the cost of increased complexity and latency.