Weaviate offers a unique approach among vector databases, featuring a modular architecture and a GraphQL-like API that integrates vector search with structured data filtering. Its flexibility allows you to plug in different vector indexing algorithms and embedding models, or even bring your own pre-computed vectors. This section guides you through using the Weaviate Python client to interact with a Weaviate instance, perform core operations, and build search functionalities.
First, ensure you have the Weaviate Python client installed:
pip install weaviate-client
To interact with Weaviate, you need to establish a connection to your instance. This could be a local instance running via Docker or a managed instance from Weaviate Cloud Services (WCS).
import weaviate
import os
# Example for connecting to a local Weaviate instance
# client = weaviate.Client("http://localhost:8080")
# Example for connecting to Weaviate Cloud Services (WCS)
# Replace with your WCS URL and API key
wcs_url = os.getenv("WEAVIATE_URL") # Or replace with your URL string
api_key = os.getenv("WEAVIATE_API_KEY") # Or replace with your API key string
client = weaviate.Client(
url=wcs_url,
auth_client_secret=weaviate.AuthApiKey(api_key=api_key),
# Optional: Specify OpenAI API key if using text2vec-openai module
# headers={
# "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
# }
)
# Check connection
if client.is_ready():
print("Successfully connected to Weaviate!")
else:
print("Failed to connect to Weaviate.")
Remember to replace the placeholder connection details with your actual URL and API key, possibly using environment variables for security as shown above.
Before adding data, you need to define a schema. In Weaviate, a schema consists of "Classes" (analogous to tables or collections) which have "Properties" (fields). Each class definition includes how its objects should be vectorized.
Let's define a simple Document
class with title
and content
properties. We'll configure it to use a pre-built transformer model (like Sentence-BERT, provided by the text2vec-transformers
module, assuming it's enabled in your Weaviate instance) to automatically generate vectors for the content
property.
# Check if schema exists to avoid errors on re-run
client.schema.delete_all() # Use with caution! Deletes all schemas.
# Or delete a specific class:
# if client.schema.exists("Document"):
# client.schema.delete_class("Document")
document_class_schema = {
"class": "Document",
"description": "A class to store text documents",
"vectorizer": "text2vec-transformers", # Specify the vectorizer module
"moduleConfig": {
"text2vec-transformers": {
"poolingStrategy": "masked_mean",
"vectorizeClassName": False # Usually set to False
}
},
"properties": [
{
"name": "title",
"dataType": ["text"],
"description": "The title of the document",
},
{
"name": "content",
"dataType": ["text"],
"description": "The main content of the document",
"moduleConfig": { # Property-specific module config
"text2vec-transformers": {
"skip": False, # Ensure this property is vectorized
"vectorizePropertyName": False # Usually set to False
}
}
},
{
"name": "word_count",
"dataType": ["int"],
"description": "The number of words in the content",
}
]
}
# Create the class in Weaviate
try:
client.schema.create_class(document_class_schema)
print("Schema 'Document' created successfully.")
except Exception as e:
print(f"Error creating schema: {e}")
# You can verify the schema creation
# schema = client.schema.get("Document")
# print(schema)
Here, we specified text2vec-transformers
as the vectorizer for the class. This means Weaviate will automatically use the configured transformer model within that module to generate a vector embedding for each Document
object, primarily based on the content
property as configured. If you wanted to provide your own pre-computed vectors, you would typically set the class vectorizer
to none
and include a vector
property when adding data.
With the schema defined, you can add, retrieve, update, and delete data objects.
Adding data involves creating objects that conform to the defined class schema. Providing data for properties specified in the moduleConfig
(like content
in our example) allows Weaviate to automatically generate the vector.
For efficiency, especially with larger datasets, use batching.
# Configure batch processing
client.batch.configure(
batch_size=100, # Number of objects per batch
dynamic=True, # Dynamically adjust batch size
timeout_retries=3, # Number of timeout retries
)
documents_to_add = [
{"title": "Introduction to Vector Databases", "content": "Vector databases are specialized systems designed to store and query high-dimensional vector embeddings.", "word_count": 16},
{"title": "Understanding ANN Search", "content": "Approximate Nearest Neighbor (ANN) search algorithms trade some accuracy for significant speed improvements in high dimensions.", "word_count": 20},
{"title": "Weaviate Client Library", "content": "The Weaviate Python client provides methods for schema management, data manipulation, and querying.", "word_count": 16},
{"title": "Semantic Search Concepts", "content": "Semantic search aims to understand the intent and contextual meaning behind user queries, going beyond keyword matching.", "word_count": 20}
]
print("Adding documents using batch processing...")
with client.batch as batch:
for i, doc_data in enumerate(documents_to_add):
# Add object to the batch
batch.add_data_object(
data_object=doc_data,
class_name="Document",
# Optionally specify a UUID, otherwise Weaviate generates one
# uuid=generate_uuid5(doc_data['title']) # Example using generate_uuid5 from weaviate.util
)
if (i + 1) % 10 == 0: # Optional: Print progress
print(f"Added {i+1} documents to batch...")
# The batch is automatically processed when exiting the 'with' block.
# Check for potential errors during batch import
if client.batch.failed_objects:
print(f"Failed to import {len(client.batch.failed_objects)} objects.")
# Inspect client.batch.failed_objects for details
else:
print(f"Successfully added {len(documents_to_add)} documents.")
You can fetch objects by their Weaviate-assigned UUID or a custom UUID you provided.
# First, get the UUID of an object (e.g., the first one we added)
# Note: This requires knowing or fetching the UUID first.
response = client.query.get("Document", ["title"]).with_limit(1).do()
if response['data']['Get']['Document']:
object_uuid = response['data']['Get']['Document'][0]['_additional']['id']
print(f"Fetching object with UUID: {object_uuid}")
# Fetch the object by UUID
fetched_object = client.data_object.get_by_id(
object_uuid,
class_name="Document"
)
print("\nFetched Object:")
print(fetched_object)
else:
print("No documents found to fetch.")
# Fetch multiple objects (up to a limit) without specifying UUIDs
all_objects = client.data_object.get(class_name="Document", limit=10)
# print("\nAll Objects (limit 10):")
# print(all_objects)
You can update properties of an existing object using its UUID.
# Assuming we have object_uuid from the previous step
if 'object_uuid' in locals():
new_title = "Weaviate Python Client Guide"
print(f"\nUpdating object {object_uuid} with new title: '{new_title}'")
# Method 1: update() - Replaces all properties specified
# client.data_object.update(
# uuid=object_uuid,
# class_name="Document",
# data_object={"title": new_title} # Only title is updated, other properties might be lost if not careful
# )
# Method 2: merge() - Merges specified properties, keeping others intact (Safer)
client.data_object.update( # Note: `merge` was deprecated/changed, use `update` with specific properties
uuid=object_uuid,
class_name="Document",
data_object={"title": new_title},
# consistency_level=weaviate.ConsistencyLevel.ONE # Optional consistency setting
)
# Verify the update
updated_object = client.data_object.get_by_id(object_uuid, class_name="Document")
print("\nVerified Updated Object:")
print(updated_object)
Important: When using update
, only the properties included in the data_object
dictionary are modified. Properties not included remain unchanged. Prior versions had a distinct merge
method; check the client documentation for the exact behavior in your version.
Objects are deleted using their UUID.
# Assuming we have object_uuid from previous steps
if 'object_uuid' in locals():
print(f"\nDeleting object with UUID: {object_uuid}")
try:
client.data_object.delete(
uuid=object_uuid,
class_name="Document",
# consistency_level=weaviate.ConsistencyLevel.ALL # Optional: Wait for all nodes
)
print(f"Object {object_uuid} deleted successfully.")
# Verify deletion
deleted_object = client.data_object.get_by_id(object_uuid, class_name="Document")
if deleted_object is None:
print("Verification: Object not found (deletion successful).")
else:
print("Verification Error: Object still found.")
except weaviate.exceptions.UnexpectedStatusCodeException as e:
if e.status_code == 404:
print(f"Object {object_uuid} already deleted or not found.")
else:
print(f"Error deleting object: {e}")
Weaviate's querying capabilities are powerful, allowing semantic search, keyword search, filtering, and combinations thereof. The Python client provides a fluent interface that builds GraphQL queries behind the scenes.
The core of semantic search in Weaviate relies on nearText
or nearVector
.
nearText
: Provide raw text. Weaviate uses the configured class vectorizer (e.g., text2vec-transformers
) to embed the query text and find similar objects.nearVector
: Provide an explicit query vector. This is useful if you generate query embeddings outside of Weaviate.search_query = "algorithms for fast vector lookup"
print(f"\nPerforming semantic search for: '{search_query}'")
# Using nearText (Weaviate embeds the query)
response = (
client.query
.get("Document", ["title", "content", "_additional {certainty distance id}"]) # Specify properties to return and additional info
.with_near_text({"concepts": [search_query]})
.with_limit(3) # Limit the number of results
.do()
)
print("\nSemantic Search Results (nearText):")
import json
print(json.dumps(response, indent=2))
# Example using nearVector (if you have a query vector)
# Assume 'query_vector' is a list or numpy array representing the embedding
# query_vector = [...]
# response_vector = (
# client.query
# .get("Document", ["title", "content", "_additional {certainty distance id}"])
# .with_near_vector({"vector": query_vector})
# .with_limit(3)
# .do()
# )
# print("\nSemantic Search Results (nearVector):")
# print(json.dumps(response_vector, indent=2))
The _additional
field allows retrieving metadata about the search results, such as certainty
(Weaviate's similarity score, higher is better), distance
(vector distance), and the object's id
(UUID).
where
You can combine semantic search (or other queries) with filtering based on object properties using the where
filter. This allows refining search results based on metadata.
search_query = "database systems"
min_word_count = 15
print(f"\nPerforming filtered semantic search for: '{search_query}' with word_count > {min_word_count}")
response = (
client.query
.get("Document", ["title", "content", "word_count", "_additional {certainty distance}"])
.with_near_text({"concepts": [search_query]})
.with_where({
"path": ["word_count"], # Property to filter on
"operator": "GreaterThan", # Comparison operator
"valueInt": min_word_count # Value to compare against (note the type hint)
})
.with_limit(3)
.do()
)
print("\nFiltered Semantic Search Results:")
print(json.dumps(response, indent=2))
The where
filter supports various operators (Equal
, NotEqual
, GreaterThan
, LessThan
, Like
for text pattern matching, etc.) and can be nested using And
/ Or
operators for complex conditions.
Weaviate also supports hybrid search, combining semantic relevance (vector similarity) with traditional keyword relevance (like BM25).
search_query = "python client usage"
print(f"\nPerforming hybrid search for: '{search_query}'")
response = (
client.query
.get("Document", ["title", "content", "_additional {score explainScore id}"]) # Get hybrid score
.with_hybrid(
query=search_query,
alpha=0.75, # Weight for vector search (1.0 = pure vector, 0.0 = pure keyword)
# properties=["content^2", "title"], # Optional: Specify properties and weights for keyword search
)
.with_limit(3)
.do()
)
print("\nHybrid Search Results:")
print(json.dumps(response, indent=2))
The alpha
parameter controls the balance: alpha=1
is pure vector search, alpha=0
is pure keyword search (BM25), and values in between blend the scores. The _additional {score}
reflects this combined score.
Working with the Weaviate client involves defining schemas that match your data and use cases, leveraging batch operations for efficient data ingestion, and utilizing its flexible query language to combine semantic search with metadata filtering or hybrid approaches. This provides a solid foundation for building sophisticated search applications tailored to understanding meaning within your data.
© 2025 ApX Machine Learning