Semantic search is a core component of retrieval systems. This approach employs text embeddings and vector similarity to find information based on the meaning and intent behind a query. Unlike traditional keyword-based search, which relies on exact term matches, semantic search allows you to find relevant documents even if they don't contain the exact words used in the search query.
The process is a direct application of the tools you've just learned. To find the most relevant documents for a given query, you perform the following steps:
While you could loop through each document embedding and calculate its similarity to the query embedding one by one, this approach is inefficient, especially with a large number of documents. A more performant method is to compute all similarities in a single, optimized operation.
The batch_similarity function is designed for this purpose. It takes a single query vector and a list of document vectors, then efficiently calculates the similarity between the query and every document.
Here’s how you can implement a basic semantic search:
from kerb.embedding import embed, embed_batch, batch_similarity
# 1. Your collection of document chunks
documents = [
"Python is a high-level programming language",
"Machine learning models learn patterns from data",
"Natural language processing helps computers understand text",
"Deep neural networks have multiple layers",
"Data science combines statistics and programming",
]
# 2. Pre-compute embeddings for all documents
doc_embeddings = embed_batch(documents)
# 3. Define a search query and generate its embedding
query = "I want to learn about AI and neural networks"
query_embedding = embed(query)
# 4. Calculate similarity scores for all documents against the query
similarities = batch_similarity(query_embedding, doc_embeddings, metric="cosine")
# 5. Combine documents with their scores and sort by relevance
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
# Display the top 3 results
print(f"Query: '{query}'\n")
print("Top 3 results:")
for i, (doc, score) in enumerate(results[:3], 1):
print(f"{i}. [{score:.4f}] {doc}")
Notice that the top result, "Deep neural networks have multiple layers," doesn't share many keywords with the query. However, because their embeddings are close in vector space, the system correctly identifies it as the most relevant document.
Sorting the entire list of similarities works well, but for applications that only require the top few results, there's a more direct approach. The top_k_similar function is optimized to find the k most similar vectors without needing to sort the entire collection. It returns a list of indices corresponding to the top-scoring documents.
This is particularly useful in RAG systems, where you typically only need the top 3-5 most relevant chunks to pass to the LLM.
from kerb.embedding import embed, embed_batch, top_k_similar, cosine_similarity
documents = [
"Python is a high-level programming language",
"Machine learning models learn patterns from data",
"Natural language processing helps computers understand text",
"Deep neural networks have multiple layers",
"Data science combines statistics and programming",
"Software engineering involves designing and building systems",
]
doc_embeddings = embed_batch(documents)
query = "programming languages and software development"
query_embedding = embed(query)
# Get the indices of the top 3 most similar documents
top_3_indices = top_k_similar(query_embedding, doc_embeddings, k=3)
print(f"Query: '{query}'\n")
print("Top 3 results using top_k_similar:")
for rank, idx in enumerate(top_3_indices, 1):
# We can get the score by calculating similarity for just the top indices
similarity = cosine_similarity(query_embedding, doc_embeddings[idx])
print(f"{rank}. [{similarity:.4f}] {documents[idx]}")
Using top_k_similar is generally faster and more memory-efficient than batch_similarity followed by a full sort, making it the preferred method for retrieval in most production scenarios.
With these components, you can encapsulate the logic into a simple, reusable search class. This is a foundational pattern for building more complex retrieval systems. The class will handle indexing the documents (pre-computing embeddings) and performing searches.
from kerb.embedding import embed, embed_batch, top_k_similar, cosine_similarity
class SimpleSearchEngine:
"""A basic semantic search engine."""
def __init__(self, documents: list[str]):
self.documents = documents
print(f"Indexing {len(documents)} documents...")
self.embeddings = embed_batch(documents)
print("Indexing complete.")
def search(self, query: str, top_k: int = 3) -> list[dict]:
"""Search for relevant documents."""
query_emb = embed(query)
top_indices = top_k_similar(query_emb, self.embeddings, k=top_k)
results = []
for idx in top_indices:
sim = cosine_similarity(query_emb, self.embeddings[idx])
results.append({
'document': self.documents[idx],
'score': sim,
'index': idx
})
return results
# Document collection
documents = [
"Python is a high-level programming language",
"Machine learning models learn patterns from data",
"Natural language processing helps computers understand text",
"Artificial intelligence enables machines to think",
"Software engineering involves designing and building systems"
]
# Create and use the search engine
engine = SimpleSearchEngine(documents)
# Perform a search
search_query = "AI and computer thought"
search_results = engine.search(search_query, top_k=2)
print(f"\nSearch: '{search_query}'")
for i, result in enumerate(search_results, 1):
print(f" {i}. [{result['score']:.4f}] {result['document']}")
This simple class demonstrates the complete end-to-end flow: indexing documents by creating embeddings and then using those embeddings to find relevant information at query time.
A common challenge in retrieval systems is handling queries that have no relevant documents in the knowledge base. In such cases, a semantic search will still return the least dissimilar documents, even if their similarity score is very low. This can lead to the LLM receiving irrelevant context and generating a poor or incorrect response.
To mitigate this, you can apply a similarity score threshold. If no documents meet the minimum score, you can conclude that there are no relevant results.
# Assuming 'doc_embeddings' and 'documents' are already defined
query = "quantum computing" # A topic not present in our small collection
query_embedding = embed(query)
# Calculate similarities for all documents
similarities = batch_similarity(query_embedding, doc_embeddings)
# Filter results based on a threshold
threshold = 0.3
relevant_results = [
(doc, sim) for doc, sim in zip(documents, similarities) if sim > threshold
]
print(f"Query: '{query}'")
print(f"Results with similarity > {threshold}:")
if relevant_results:
for doc, sim in sorted(relevant_results, key=lambda x: x[1], reverse=True):
print(f" [{sim:.4f}] {doc}")
else:
print(f" No results found above threshold {threshold}")
best_match_score = max(similarities)
best_match_doc = documents[similarities.index(best_match_score)]
print(f" (Best match below threshold: [{best_match_score:.4f}] {best_match_doc})")
Choosing an appropriate threshold is use-case specific and often requires experimentation with your dataset. A good starting point is typically between 0.25 and 0.4. By filtering out low-quality results, you ensure that the context passed to the language model is both relevant and useful, which is fundamental to the performance of any RAG system.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with