Let's bring together the concepts we've covered by building a small, functional semantic search application. This exercise integrates an embedding model, a vector database client, and a simple web framework, demonstrating a complete, albeit basic, search pipeline from user query to relevant results. We'll use components that are straightforward to set up locally, allowing you to focus on the interaction between them.
For this practical, we will use:
sentence-transformers
library with a pre-trained model like all-MiniLM-L6-v2
. This model is efficient and provides good quality embeddings for sentences and short paragraphs.ChromaDB
. We'll use its Python client for local, persistent storage, which simplifies setup for this example.FastAPI
. A modern Python web framework that's easy to use and automatically generates interactive API documentation.First, ensure you have the necessary libraries installed. You can install them using pip:
pip install sentence-transformers chromadb fastapi uvicorn[standard] python-multipart Jinja2
sentence-transformers
: For loading the embedding model and generating vectors.chromadb
: The client library for interacting with the Chroma vector database.fastapi
: The web framework for creating our API endpoint.uvicorn
: An ASGI server to run our FastAPI application.python-multipart
: Required by FastAPI for handling form data (though we might use JSON).Jinja2
: Used by FastAPI for optional HTML templating if needed, often included with FastAPI's dependencies.Let's create a script (index_data.py
) to prepare some sample data, generate embeddings, and index them into ChromaDB.
# index_data.py
import chromadb
from sentence_transformers import SentenceTransformer
# --- Configuration ---
MODEL_NAME = 'all-MiniLM-L6-v2'
COLLECTION_NAME = "docs_collection"
PERSIST_DIRECTORY = "./chroma_db_persist" # Directory to store DB data
# --- Sample Data ---
# Simple list of documents (sentences in this case)
documents = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is transforming many industries.",
"Vector databases are optimized for similarity search.",
"Natural language processing enables computers to understand text.",
"The capital of France is Paris.",
"Apples are a type of fruit, often red or green.",
"Machine learning algorithms learn from data.",
"Semantic search provides results based on meaning, not just keywords.",
]
# --- Initialization ---
print("Initializing embedding model...")
# Load the pre-trained sentence transformer model
# This model maps sentences & paragraphs to a 384 dimensional dense vector space
# It will download the model automatically if not present
model = SentenceTransformer(MODEL_NAME)
print("Initializing ChromaDB client...")
# Initialize ChromaDB client with persistence
# This will save the database state to the specified directory
client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)
print(f"Getting or creating collection: {COLLECTION_NAME}")
# Get or create the collection. If it exists, it will be loaded.
# Specify the embedding function based on our SentenceTransformer model
collection = client.get_or_create_collection(
name=COLLECTION_NAME,
embedding_function=chromadb.utils.embedding_functions.SentenceTransformerEmbeddingFunction(model_name=MODEL_NAME)
# You can also explicitly pass metadata={'hnsw:space': 'cosine'} if needed,
# but SentenceTransformerEmbeddingFunction often defaults appropriately.
)
# --- Indexing ---
print("Generating IDs and preparing data for indexing...")
# Generate simple sequential IDs for this example
doc_ids = [f"doc_{i}" for i in range(len(documents))]
# Check if data needs indexing (simple check based on expected count)
# In a real app, you might have a more robust way to track indexed data
if collection.count() < len(documents):
print(f"Indexing {len(documents)} documents...")
try:
# Add documents to the collection
# ChromaDB's SentenceTransformerEmbeddingFunction handles embedding generation automatically here
collection.add(
documents=documents,
ids=doc_ids
)
print("Documents indexed successfully.")
except Exception as e:
print(f"Error indexing documents: {e}")
else:
print("Documents seem to be already indexed.")
print(f"Collection '{COLLECTION_NAME}' now contains {collection.count()} documents.")
print("Indexing script finished.")
Explanation:
documents
and configuration parameters.SentenceTransformer
model. The first time you run this, it will download the model weights.PersistentClient
for ChromaDB, specifying a directory (./chroma_db_persist
) where the database files will be stored. This makes our index persistent across runs.client.get_or_create_collection
. This is convenient because it either creates the collection if it doesn't exist or loads the existing one if it does. We associate our SentenceTransformer
model with the collection via embedding_function
. ChromaDB will use this function automatically when we add documents or perform queries.doc_ids
) for each document.collection.add
. Because we configured an embedding_function
, ChromaDB calls the model internally to get the vectors for each document before storing them. We include a basic check to avoid re-indexing every time.Run this script once to populate your local ChromaDB:
python index_data.py
You should see output indicating initialization and successful indexing, and a chroma_db_persist
directory will be created.
Now, let's create the web application (main.py
) that will serve our search requests.
# main.py
import chromadb
from fastapi import FastAPI, Query, HTTPException
from sentence_transformers import SentenceTransformer
import uvicorn # For running the app
# --- Configuration ---
MODEL_NAME = 'all-MiniLM-L6-v2'
COLLECTION_NAME = "docs_collection"
PERSIST_DIRECTORY = "./chroma_db_persist"
N_RESULTS = 3 # Number of search results to return
# --- Application Initialization ---
app = FastAPI(
title="Simple Semantic Search API",
description="An API that uses a vector database for semantic search.",
version="0.1.0"
)
# --- Global Variables / Resources ---
# Initialize resources once when the application starts
try:
print("Loading embedding model...")
embedding_model = SentenceTransformer(MODEL_NAME)
print("Model loaded successfully.")
print("Connecting to ChromaDB...")
db_client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)
collection = db_client.get_collection(name=COLLECTION_NAME)
# Verify collection has items (optional but good practice)
if collection.count() == 0:
print(f"Warning: Collection '{COLLECTION_NAME}' is empty. Did you run index_data.py?")
print("ChromaDB connection successful.")
except Exception as e:
print(f"Error during initialization: {e}")
# Handle initialization failure appropriately, maybe exit or raise specific error
embedding_model = None
collection = None
# --- API Endpoints ---
@app.get("/search/")
async def perform_search(
query: str = Query(..., min_length=3, description="The search query text.")
):
"""
Performs semantic search on the indexed documents.
Takes a query string, generates its embedding, and searches the vector
database for the most similar documents.
"""
if not embedding_model or not collection:
raise HTTPException(status_code=503, detail="Search service is not available due to initialization error.")
print(f"Received query: '{query}'")
try:
# 1. Generate embedding for the query
print("Generating query embedding...")
query_embedding = embedding_model.encode(query).tolist()
print("Query embedding generated.")
# 2. Query the vector database
print(f"Querying collection '{COLLECTION_NAME}'...")
results = collection.query(
query_embeddings=[query_embedding], # Note: query_embeddings expects a list of embeddings
n_results=N_RESULTS,
include=['documents', 'distances'] # Ask ChromaDB to return documents and distances
)
print("Query executed successfully.")
# 3. Format and return results
# The results structure can be a bit nested, let's simplify it
if results and results.get('ids') and results['ids'][0]:
formatted_results = []
ids = results['ids'][0]
distances = results['distances'][0]
documents = results['documents'][0]
for i in range(len(ids)):
formatted_results.append({
"id": ids[i],
"document": documents[i],
"distance": distances[i] # Lower distance means more similar for cosine/euclidean
})
return {"results": formatted_results}
else:
return {"results": []} # Return empty list if no results found
except Exception as e:
print(f"Error during search for query '{query}': {e}")
raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}")
@app.get("/")
async def read_root():
""" A simple root endpoint to check if the API is running. """
return {"message": "Semantic Search API is running. Use the /search/ endpoint."}
# --- Main Execution ---
# This block allows running the app directly using `python main.py`
if __name__ == "__main__":
print("Starting FastAPI server...")
uvicorn.run(app, host="0.0.0.0", port=8000)
Explanation:
SentenceTransformer
model and connect to the persistent ChromaDB collection once when the application starts. This avoids reloading the model or reconnecting to the DB on every request, which would be very inefficient. Error handling is added for robustness.query
parameter (a string).query
using the same model we used for indexing. This is important for comparing vectors meaningfully.collection.query
to find the N_RESULTS
most similar document embeddings to the query_embedding
. We request that ChromaDB includes the original documents
and distances
in the response./
endpoint confirms the API is running.if __name__ == "__main__":
block allows you to run the server directly using python main.py
. Alternatively, you can use uvicorn main:app --reload --host 0.0.0.0 --port 8000
. The --reload
flag is useful during development as it automatically restarts the server when you save changes.Index the Data: If you haven't already, run python index_data.py
.
Start the API Server: Run uvicorn main:app --reload --port 8000
.
Test the API: Open your web browser or use a tool like curl
to send requests to the search endpoint:
http://localhost:8000/search/?query=what+is+AI
curl "http://localhost:8000/search/?query=information%20about%20databases"
curl "http://localhost:8000/search/?query=tell%20me%20about%20animals"
You should receive JSON responses containing the most relevant documents from your small dataset based on semantic similarity, along with their distances. For example, querying about "databases" should return results related to vector databases and possibly machine learning. Querying about "animals" should retrieve the sentence about the fox.
The diagram illustrates the request flow for the semantic search application. A user sends a query to the API endpoint, which uses the embedding model to convert the query into a vector. This vector is then used to search the ChromaDB collection for similar document vectors. The results are formatted and returned to the user.
This example provides a basic structure. You could extend it in many ways:
where
clauses in the query
method) to refine search results.This hands-on exercise demonstrates how the components discussed throughout this course, embedding models, vector databases, and search logic, come together to create applications that understand the meaning behind user queries.
© 2025 ApX Machine Learning