Selecting the right embedding model is a foundational decision in building any retrieval system. It's a choice that directly influences your application's retrieval quality, operational cost, speed, and infrastructure requirements. There is no single "best" model; instead, the optimal choice depends on a series of trade-offs tailored to your specific use case. The toolkit provides a unified interface to several types of models, which can be grouped into three main categories.
The models available through the embedding module fall into three distinct tiers, each serving a different purpose.
Local Hash-Based Models: These models, accessed via EmbeddingModel.LOCAL, do not rely on machine learning. They generate deterministic vectors by hashing the input text. They require no external dependencies or downloads and are extremely fast, making them ideal for testing and prototyping where true semantic understanding is not necessary.
Local Machine Learning Models (Sentence Transformers): These are open-source models like all-MiniLM-L6-v2 that you download and run on your own infrastructure. They offer a strong balance of quality and cost (since they are free to use, minus the compute cost). They are an excellent choice for applications requiring data privacy, offline capabilities, or predictable performance without API-related network latency.
Cloud API Models (OpenAI): These are state-of-the-art proprietary models, such as text-embedding-3-large, accessed via an API. They generally provide the highest quality embeddings available but come with usage-based pricing and a dependency on an internet connection. They are suited for production applications where top-tier performance is a priority.
Your decision should be guided by four primary factors: quality, cost, speed, and infrastructure.
Embedding quality refers to how well the model's output vectors capture the semantic meaning of the text. High-quality embeddings place similar concepts close together in vector space, which directly translates to more relevant search results. Public benchmarks like the Massive Text Embedding Benchmark (MTEB) leaderboard are a good resource for comparing the performance of different models on various tasks. Generally, larger and more modern models, like OpenAI's text-embedding-3-large, rank highest in quality.
Cost varies significantly between model types.
Latency is the time it takes to generate an embedding. For real-time applications, like a chatbot that needs to perform a RAG lookup on the fly, low latency is significant.
embed_batch, is an important technique for improving throughput with API-based models.Each embedding model produces vectors of a fixed size, or dimension. For example:
all-MiniLM-L6-v2: 384 dimensionstext-embedding-3-small: 1536 dimensionstext-embedding-3-large: 3072 dimensionsHigher dimensionality can capture more detailed information, often correlating with higher quality. However, it comes with trade-offs: larger vectors require more storage space and can increase the computational cost and time of similarity searches in a vector database.
The following chart visualizes the relationship between embedding quality and cost for some popular models.
This chart plots relative quality scores from public benchmarks against the cost to process one million tokens. Local models have a cost of zero.
Here’s how to choose and use a model based on your application's needs.
When you are building unit tests or developing the initial structure of your application, semantic accuracy is often less important than speed and simplicity. The default hash-based local model is perfect for this. It requires no setup and is deterministic.
from kerb.embedding import embed, EmbeddingModel
# No dependencies or API keys needed
test_vector = embed(
"This is for a unit test.",
model=EmbeddingModel.LOCAL
)
print(f"Generated a {len(test_vector)}-dimensional vector for testing.")
If you are building an application where budget is a primary concern, or if your data cannot be sent to a third-party API for privacy reasons, Sentence Transformers are the best choice. They run entirely on your own infrastructure. For a good starting point, all-MiniLM-L6-v2 offers a great balance of speed and quality.
To use Sentence Transformer models, you first need to install the necessary libraries:
pip install sentence-transformers
from kerb.embedding import embed, EmbeddingModel
# This will download the model on first use and run it locally
minilm_vector = embed(
"A document for a cost-effective RAG system.",
model=EmbeddingModel.ALL_MINILM_L6_V2
)
print(f"MiniLM vector dimensions: {len(minilm_vector)}")
When achieving the highest possible retrieval quality is the main goal, cloud-based API models are the recommended option. OpenAI's models are integrated directly and provide state-of-the-art performance. The text-embedding-3-small model is a great, cost-effective starting point, while text-embedding-3-large offers the best quality for more demanding tasks.
To use OpenAI models, you need the openai library and an API key.
pip install openai
import os
from kerb.embedding import embed, EmbeddingModel
# Assumes OPENAI_API_KEY is set as an environment variable
# or you can pass it directly with the api_key parameter
openai_vector = embed(
"A query for a high-performance production RAG system.",
model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL,
api_key=os.getenv("OPENAI_API_KEY")
)
print(f"OpenAI vector dimensions: {len(openai_vector)}")
To help you decide, follow this simple decision process.
A decision flowchart to guide your choice of embedding model based on project requirements.
Ultimately, the best way to choose a model is to experiment. Use this guidance as a starting point, but consider running your own evaluations on a small sample of your actual data to see which model provides the best results for your specific domain and query patterns.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with