Generating Embeddings

To enable computers to understand and compare document content based on meaning, text embeddings are used. These embeddings are numerical representations, or vectors, of text. They capture the semantic essence of the text, allowing systems to find document chunks that are thematically related to a user's query, even if direct keywords are not shared.

Kerb's embedding module provides a unified interface for generating these embeddings using various models, from local, dependency-free options for testing to high-performance models from providers like OpenAI.

Creating Your First Embedding

The most direct way to generate an embedding is with the embed function. It takes a string of text and returns a vector, which is represented as a list of floating-point numbers.

Let's generate an embedding for a simple sentence and inspect its properties:

from kerb.embedding import embed, embedding_dimension, vector_magnitude

text = "Machine learning transforms data into insights"
embedding = embed(text)

print(f"Text: '{text}'")
print(f"Embedding dimension: {embedding_dimension(embedding)}")
print(f"Vector magnitude: {vector_magnitude(embedding):.6f}")
print(f"First 5 values: {[round(v, 4) for v in embedding[:5]]}")

This will produce an output similar to the following:

Text: 'Machine learning transforms data into insights'
Embedding dimension: 384
Vector magnitude: 1.000000
First 5 values: [0.034, -0.0121, 0.0589, 0.0076, -0.045]

By default, the embed function uses a local, hash-based method that requires no external dependencies or API calls. This is useful for prototyping and testing because it's fast and deterministic. However, it does not produce semantically meaningful vectors. For genuine semantic search, you'll need to use a more sophisticated model.

Selecting an Embedding Model

The embed function supports various models through its model parameter. You can specify a model using the EmbeddingModel enum, which provides a convenient, type-safe way to select from well-known local and API-based models.

Local Models with Sentence Transformers

For many applications, running an embedding model locally is a great option. It offers a balance of high-quality embeddings, privacy, and no API costs. The toolkit integrates with the sentence-transformers library to support this.

To use a Sentence Transformers model, you must first install the necessary dependency:

pip install sentence-transformers

Once installed, you can specify a model like EmbeddingModel.ALL_MINILM_L6_V2, which is a popular, well-balanced choice.

from kerb.embedding import embed, EmbeddingModel

# This code requires 'pip install sentence-transformers'
text = "Natural language processing enables AI to understand text"
st_embedding = embed(
    text,
    model=EmbeddingModel.ALL_MINILM_L6_V2
)

print(f"Sentence Transformers model generated an embedding with {len(st_embedding)} dimensions.")

API-based Models from OpenAI

For the highest-quality embeddings, you can use API-based models from providers like OpenAI. This requires an API key and incurs costs per usage, but it often yields the best performance for semantic retrieval.

First, ensure you have the OpenAI library installed:

pip install openai

You'll also need to set your OpenAI API key as an environment variable (OPENAI_API_KEY). Then, you can select an OpenAI model such as EmbeddingModel.TEXT_EMBEDDING_3_SMALL.

from kerb.embedding import embed, EmbeddingModel

# This code requires 'pip install openai' and an API key
text = "Natural language processing enables AI to understand text"
openai_embedding = embed(
    text,
    model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL
)

print(f"OpenAI model generated an embedding with {len(openai_embedding)} dimensions.")

Different models produce embeddings of different dimensions. For instance, ALL_MINILM_L6_V2 creates a 384-dimension vector, while OpenAI's TEXT_EMBEDDING_3_SMALL creates a 1536-dimension vector. While larger vectors can capture more details, they also require more storage and computational resources.

Processing Documents in Batches

In a typical RAG system, you'll need to embed hundreds or thousands of document chunks. Calling embed in a loop for each chunk is inefficient, especially when using a GPU-accelerated local model or making API calls.

The embed_batch function is designed for this exact scenario. It processes a list of texts in a single, optimized call, significantly improving performance.

from kerb.embedding import embed_batch, EmbeddingModel

# A list of document chunks from the previous chapter
document_chunks = [
    "Python is a high-level programming language",
    "Machine learning models learn patterns from data",
    "Natural language processing helps computers understand text",
    "Deep neural networks have multiple layers",
    "Data science combines statistics and programming"
]

# Generate embeddings for all chunks at once
# Note: This requires 'pip install sentence-transformers'
chunk_embeddings = embed_batch(
    document_chunks,
    model=EmbeddingModel.ALL_MINILM_L6_V2
)

print(f"Generated {len(chunk_embeddings)} embeddings.")
print(f"Each embedding has {len(chunk_embeddings[0])} dimensions.")

Using embed_batch is the standard practice for preparing a knowledge base for a RAG system. It ensures that your entire corpus of text chunks is efficiently converted into a collection of vectors, ready for storage and retrieval. With these vectors in hand, you are now prepared to perform mathematical comparisons to find semantically relevant information, which we will cover in the next section.

Was this section helpful?

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers and Iryna Gurevych, 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics) DOI: 10.18653/v1/D19-1410 - Introduces the Sentence-BERT architecture for generating semantically meaningful sentence embeddings, foundational to the sentence-transformers library.
Embeddings, OpenAI, 2024 (OpenAI) - Official guide to OpenAI's embedding models, including text-embedding-3-small, and practices for their use.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A widely respected textbook covering fundamental concepts of Natural Language Processing, including vector semantics and word/sentence embeddings. Refer to relevant chapters (e.g., Chapter 6).