Choosing an Embedding Model

Selecting the right embedding model is a foundational decision in building any retrieval system. It's a choice that directly influences your application's retrieval quality, operational cost, speed, and infrastructure requirements. There is no single "best" model; instead, the optimal choice depends on a series of trade-offs tailored to your specific use case. The toolkit provides a unified interface to several types of models, which can be grouped into three main categories.

The Tiers of Embedding Models

The models available through the embedding module fall into three distinct tiers, each serving a different purpose.

Local Hash-Based Models: These models, accessed via EmbeddingModel.LOCAL, do not rely on machine learning. They generate deterministic vectors by hashing the input text. They require no external dependencies or downloads and are extremely fast, making them ideal for testing and prototyping where true semantic understanding is not necessary.
Local Machine Learning Models (Sentence Transformers): These are open-source models like all-MiniLM-L6-v2 that you download and run on your own infrastructure. They offer a strong balance of quality and cost (since they are free to use, minus the compute cost). They are an excellent choice for applications requiring data privacy, offline capabilities, or predictable performance without API-related network latency.
Cloud API Models (OpenAI): These are state-of-the-art proprietary models, such as text-embedding-3-large, accessed via an API. They generally provide the highest quality embeddings available but come with usage-based pricing and a dependency on an internet connection. They are suited for production applications where top-tier performance is a priority.

Trade-Offs When Choosing a Model

Your decision should be guided by four primary factors: quality, cost, speed, and infrastructure.

Quality

Embedding quality refers to how well the model's output vectors capture the semantic meaning of the text. High-quality embeddings place similar concepts close together in vector space, which directly translates to more relevant search results. Public benchmarks like the Massive Text Embedding Benchmark (MTEB) leaderboard are a good resource for comparing the performance of different models on various tasks. Generally, larger and more modern models, like OpenAI's text-embedding-3-large, rank highest in quality.

Cost and Infrastructure

Cost varies significantly between model types.

Cloud API Models have a direct, per-token cost. While easy to start with, costs can accumulate with high-volume usage.
Local ML Models are free to use but require you to manage the computational infrastructure. This might involve using a machine with a GPU for acceptable performance, which has its own associated costs.
Local Hash-Based Models are entirely free and have no infrastructure requirements in a standard Python environment.

Speed (Latency)

Latency is the time it takes to generate an embedding. For real-time applications, like a chatbot that needs to perform a RAG lookup on the fly, low latency is significant.

Local Models (both hash-based and ML) are generally very fast, as they run on your hardware without network overhead.
API Models introduce network latency, although providers optimize for speed. Batching requests, as shown with embed_batch, is an important technique for improving throughput with API-based models.

Dimensionality

Each embedding model produces vectors of a fixed size, or dimension. For example:

all-MiniLM-L6-v2: 384 dimensions
text-embedding-3-small: 1536 dimensions
text-embedding-3-large: 3072 dimensions

Higher dimensionality can capture more detailed information, often correlating with higher quality. However, it comes with trade-offs: larger vectors require more storage space and can increase the computational cost and time of similarity searches in a vector database.

The following chart visualizes the relationship between embedding quality and cost for some popular models.

This chart plots relative quality scores from public benchmarks against the cost to process one million tokens. Local models have a cost of zero.

Practical Guidance and Examples

Here’s how to choose and use a model based on your application's needs.

For Testing and Prototyping

When you are building unit tests or developing the initial structure of your application, semantic accuracy is often less important than speed and simplicity. The default hash-based local model is perfect for this. It requires no setup and is deterministic.

from kerb.embedding import embed, EmbeddingModel

# No dependencies or API keys needed
test_vector = embed(
    "This is for a unit test.", 
    model=EmbeddingModel.LOCAL
)

print(f"Generated a {len(test_vector)}-dimensional vector for testing.")

For Cost-Effective or Private Applications

If you are building an application where budget is a primary concern, or if your data cannot be sent to a third-party API for privacy reasons, Sentence Transformers are the best choice. They run entirely on your own infrastructure. For a good starting point, all-MiniLM-L6-v2 offers a great balance of speed and quality.

To use Sentence Transformer models, you first need to install the necessary libraries: pip install sentence-transformers

from kerb.embedding import embed, EmbeddingModel

# This will download the model on first use and run it locally
minilm_vector = embed(
    "A document for a cost-effective RAG system.",
    model=EmbeddingModel.ALL_MINILM_L6_V2
)

print(f"MiniLM vector dimensions: {len(minilm_vector)}")

For High-Performance Production Systems

When achieving the highest possible retrieval quality is the main goal, cloud-based API models are the recommended option. OpenAI's models are integrated directly and provide state-of-the-art performance. The text-embedding-3-small model is a great, cost-effective starting point, while text-embedding-3-large offers the best quality for more demanding tasks.

To use OpenAI models, you need the openai library and an API key. pip install openai

import os
from kerb.embedding import embed, EmbeddingModel

# Assumes OPENAI_API_KEY is set as an environment variable
# or you can pass it directly with the api_key parameter
openai_vector = embed(
    "A query for a high-performance production RAG system.",
    model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL,
    api_key=os.getenv("OPENAI_API_KEY") 
)

print(f"OpenAI vector dimensions: {len(openai_vector)}")

A Decision-Making Framework

To help you decide, follow this simple decision process.

A decision flowchart to guide your choice of embedding model based on project requirements.

Ultimately, the best way to choose a model is to experiment. Use this guidance as a starting point, but run your own evaluations on a small sample of your actual data to see which model provides the best results for your specific domain and query patterns.

Was this section helpful?

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers and Iryna Gurevych, 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) DOI: 10.48550/arXiv.1908.10084 - Foundational paper for Sentence Transformers, explaining the architecture and training methodology behind local machine learning embedding models.
Sentence Transformers Documentation, UKPLab, Hugging Face, and the Sentence Transformers community, 2024 (Hugging Face) - Comprehensive practical guide and documentation for using Sentence Transformer models, covering various models and their applications.