Performing Semantic Search

Semantic search is a core component of retrieval systems. This approach employs text embeddings and vector similarity to find information based on the meaning and intent behind a query. Unlike traditional keyword-based search, which relies on exact term matches, semantic search allows you to find relevant documents even if they don't contain the exact words used in the search query.

The process is a direct application of the tools you've just learned. To find the most relevant documents for a given query, you perform the following steps:

Generate an embedding for the user's query.
Compare this query embedding to the pre-computed embeddings of all document chunks in your knowledge base.
Calculate a similarity score (using cosine similarity) for each document chunk.
Rank the document chunks from highest to lowest score to identify the most relevant ones.

Calculating Similarity in Batches

While you could loop through each document embedding and calculate its similarity to the query embedding one by one, this approach is inefficient, especially with a large number of documents. A more performant method is to compute all similarities in a single, optimized operation.

The batch_similarity function is designed for this purpose. It takes a single query vector and a list of document vectors, then efficiently calculates the similarity between the query and every document.

Here’s how you can implement a basic semantic search:

from kerb.embedding import embed, embed_batch, batch_similarity

# 1. Your collection of document chunks
documents = [
    "Python is a high-level programming language",
    "Machine learning models learn patterns from data",
    "Natural language processing helps computers understand text",
    "Deep neural networks have multiple layers",
    "Data science combines statistics and programming",
]

# 2. Pre-compute embeddings for all documents
doc_embeddings = embed_batch(documents)

# 3. Define a search query and generate its embedding
query = "I want to learn about AI and neural networks"
query_embedding = embed(query)

# 4. Calculate similarity scores for all documents against the query
similarities = batch_similarity(query_embedding, doc_embeddings, metric="cosine")

# 5. Combine documents with their scores and sort by relevance
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

# Display the top 3 results
print(f"Query: '{query}'\n")
print("Top 3 results:")
for i, (doc, score) in enumerate(results[:3], 1):
    print(f"{i}. [{score:.4f}] {doc}")

Notice that the top result, "Deep neural networks have multiple layers," doesn't share many keywords with the query. However, because their embeddings are close in vector space, the system correctly identifies it as the most relevant document.

Retrieving Top-K Results Efficiently

Sorting the entire list of similarities works well, but for applications that only require the top few results, there's a more direct approach. The top_k_similar function is optimized to find the k most similar vectors without needing to sort the entire collection. It returns a list of indices corresponding to the top-scoring documents.

This is particularly useful in RAG systems, where you typically only need the top 3-5 most relevant chunks to pass to the LLM.

from kerb.embedding import embed, embed_batch, top_k_similar, cosine_similarity

documents = [
    "Python is a high-level programming language",
    "Machine learning models learn patterns from data",
    "Natural language processing helps computers understand text",
    "Deep neural networks have multiple layers",
    "Data science combines statistics and programming",
    "Software engineering involves designing and building systems",
]
doc_embeddings = embed_batch(documents)

query = "programming languages and software development"
query_embedding = embed(query)

# Get the indices of the top 3 most similar documents
top_3_indices = top_k_similar(query_embedding, doc_embeddings, k=3)

print(f"Query: '{query}'\n")
print("Top 3 results using top_k_similar:")
for rank, idx in enumerate(top_3_indices, 1):
    # We can get the score by calculating similarity for just the top indices
    similarity = cosine_similarity(query_embedding, doc_embeddings[idx])
    print(f"{rank}. [{similarity:.4f}] {documents[idx]}")

Using top_k_similar is generally faster and more memory-efficient than batch_similarity followed by a full sort, making it the preferred method for retrieval in most production scenarios.

Building a Simple Search Engine

With these components, you can encapsulate the logic into a simple, reusable search class. This is a foundational pattern for building more complex retrieval systems. The class will handle indexing the documents (pre-computing embeddings) and performing searches.

from kerb.embedding import embed, embed_batch, top_k_similar, cosine_similarity

class SimpleSearchEngine:
    """A basic semantic search engine."""

    def __init__(self, documents: list[str]):
        self.documents = documents
        print(f"Indexing {len(documents)} documents...")
        self.embeddings = embed_batch(documents)
        print("Indexing complete.")

    def search(self, query: str, top_k: int = 3) -> list[dict]:
        """Search for relevant documents."""
        query_emb = embed(query)
        top_indices = top_k_similar(query_emb, self.embeddings, k=top_k)

        results = []
        for idx in top_indices:
            sim = cosine_similarity(query_emb, self.embeddings[idx])
            results.append({
                'document': self.documents[idx],
                'score': sim,
                'index': idx
            })
        return results

# Document collection
documents = [
    "Python is a high-level programming language",
    "Machine learning models learn patterns from data",
    "Natural language processing helps computers understand text",
    "Artificial intelligence enables machines to think",
    "Software engineering involves designing and building systems"
]

# Create and use the search engine
engine = SimpleSearchEngine(documents)

# Perform a search
search_query = "AI and computer thought"
search_results = engine.search(search_query, top_k=2)

print(f"\nSearch: '{search_query}'")
for i, result in enumerate(search_results, 1):
    print(f"  {i}. [{result['score']:.4f}] {result['document']}")

This simple class demonstrates the complete end-to-end flow: indexing documents by creating embeddings and then using those embeddings to find relevant information at query time.

Handling Irrelevant Queries with a Threshold

A common challenge in retrieval systems is handling queries that have no relevant documents in the knowledge base. In such cases, a semantic search will still return the least dissimilar documents, even if their similarity score is very low. This can lead to the LLM receiving irrelevant context and generating a poor or incorrect response.

To mitigate this, you can apply a similarity score threshold. If no documents meet the minimum score, you can conclude that there are no relevant results.

# Assuming 'doc_embeddings' and 'documents' are already defined
query = "quantum computing" # A topic not present in our small collection
query_embedding = embed(query)

# Calculate similarities for all documents
similarities = batch_similarity(query_embedding, doc_embeddings)

# Filter results based on a threshold
threshold = 0.3
relevant_results = [
    (doc, sim) for doc, sim in zip(documents, similarities) if sim > threshold
]

print(f"Query: '{query}'")
print(f"Results with similarity > {threshold}:")

if relevant_results:
    for doc, sim in sorted(relevant_results, key=lambda x: x[1], reverse=True):
        print(f"  [{sim:.4f}] {doc}")
else:
    print(f"  No results found above threshold {threshold}")
    best_match_score = max(similarities)
    best_match_doc = documents[similarities.index(best_match_score)]
    print(f"  (Best match below threshold: [{best_match_score:.4f}] {best_match_doc})")

Choosing an appropriate threshold is use-case specific and often requires experimentation with your dataset. A good starting point is typically between 0.25 and 0.4. By filtering out low-quality results, you ensure that the context passed to the language model is both relevant and useful, which is fundamental to the performance of any RAG system.

Was this section helpful?

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers and Iryna Gurevych, 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DOI: 10.48550/arXiv.1908.10084 - This paper introduces Sentence-BERT, a modification of BERT that yields semantically meaningful sentence embeddings useful for tasks like semantic similarity search.
Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, 2008 (Cambridge University Press) - This widely used textbook provides fundamental concepts of information retrieval, including vector space models and ranking, which are pertinent to understanding semantic search.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020 Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vol. 33 DOI: 10.48550/arXiv.2005.11401 - This paper presents the Retrieval-Augmented Generation (RAG) framework, combining a pre-trained retriever with a generator for better knowledge-intensive NLP tasks, directly relevant to the mention of RAG systems.