As discussed earlier, while vector search excels at capturing semantic meaning and finding conceptually similar items, it can sometimes miss documents where specific keywords, identifiers, or exact phrases are important. For instance, searching for a product code like XG-500
or a specific function name like calculate_fft
might not reliably surface the most relevant documents if the semantic representation doesn't strongly emphasize that exact term compared to its broader context. This is where traditional keyword search techniques retain significant value.
Integrating keyword search, particularly using robust algorithms like Okapi BM25, alongside vector search allows you to create a hybrid system that leverages the strengths of both approaches: semantic understanding from vectors and precise term matching from keywords.
You might be familiar with TF-IDF (Term Frequency Inverse Document Frequency), a classic technique for scoring documents based on keyword relevance. It calculates a weight for each term in a document based on how often the term appears in that document (TF) and how rare the term is across the entire collection of documents (IDF). The basic idea is that terms frequent in a specific document but rare overall are good indicators of that document's content.
While foundational, TF-IDF has limitations. It doesn't account for term frequency saturation (the idea that the 10th occurrence of a word adds less relevance than the 1st) and lacks a sophisticated way to handle variations in document length.
BM25 (Best Matching 25) is a more advanced probabilistic retrieval function that directly addresses these limitations and generally provides superior results for keyword search. It has become the standard in many modern search engines. BM25 refines the TF and IDF concepts and introduces document length normalization.
The scoring formula for a document D given a query Q (containing terms q1,...,qn) in BM25 is typically represented as:
score(D,Q)=i=1∑nIDF(qi)⋅f(qi,D)+k1⋅(1−b+b⋅avgdl∣D∣)f(qi,D)⋅(k1+1)Let's break down the components:
Inverse Document Frequency (IDF): Similar to TF-IDF, this component measures the rarity or informativeness of a query term qi. Terms appearing in fewer documents receive a higher IDF score. There are various ways to calculate IDF; a common one is:
IDF(qi)=log(n(qi)+0.5N−n(qi)+0.5+1)where N is the total number of documents in the collection, and n(qi) is the number of documents containing the term qi. The +0.5 terms are added for smoothing, preventing division by zero for terms not in the corpus and avoiding issues with terms present in more than half the documents.
Term Frequency Saturation: The fraction f(qi,D)+k1⋅(…)f(qi,D)⋅(k1+1) component modulates the raw term frequency f(qi,D). The parameter k1 (typically between 1.2 and 2.0) controls how quickly the score saturates. As a term appears more frequently in a document, its contribution to the score increases, but the rate of increase diminishes. This prevents documents that simply repeat a term many times from dominating the results unfairly.
Document Length Normalization: The term (1−b+b⋅avgdl∣D∣) normalizes the score based on the document length ∣D∣ relative to the average document length avgdl
in the collection. The parameter b (typically around 0.75) controls the degree of normalization. When b=1, the normalization is fully applied; when b=0, there's no length normalization. This helps prevent very long documents from having an unfair advantage simply because they have more opportunities to contain query terms.
By tuning k1 and b, you can adapt BM25's behavior to your specific dataset and query patterns.
Integrating BM25 (or another keyword algorithm) with vector search usually involves running both search types in parallel and then combining the results.
Parallel Query Execution: When a user query arrives, it's sent simultaneously to:
Indexing Requirements: This approach requires maintaining two distinct types of indices:
System Architecture: You might use separate, specialized systems connected at the application layer, or leverage platforms designed for hybrid search that manage both dense and sparse (keyword) representations internally. The choice depends on your scale, performance requirements, and existing infrastructure.
Flow of a hybrid search system executing vector and keyword search in parallel before combining results.
Executing these searches independently yields two distinct sets of candidate documents, each ranked according to its respective scoring mechanism (similarity score for vectors, BM25 score for keywords). The next critical step, which we'll cover in the following section, is how to effectively merge or fuse these separate result lists into a single, coherent ranking that reflects both semantic relevance and keyword importance.
© 2025 ApX Machine Learning