To process large volumes of information, documents are often segmented into manageable units. These units, however, are initially just raw text. For a computer to find relevant information, it needs a way to understand that a user's query for "ways to make a computer think" is related to a text unit describing "artificial intelligence". Simple keyword matching would fail this test. This is where text embeddings come in.
Text embeddings are numerical representations of text, typically in the form of vectors, which are lists of floating-point numbers. The significant aspect of these vectors is that they capture the semantic meaning and context of the original text. An embedding model, a specialized type of neural network, processes a piece of text and outputs a dense vector of a fixed size.
Think of it like giving every word or sentence a unique coordinate in a high-dimensional "meaning space". In this space, texts with similar meanings are located close to each other, while dissimilar texts are far apart. For example, the vectors for "car", "automobile", and "vehicle" would be clustered together, whereas the vector for "banana" would be in a completely different region of the space.
With the Kerb toolkit, generating an embedding is a straightforward function call. The embed function takes a string of text and returns its vector representation.
from kerb.embedding import embed, embedding_dimension
text = "Machine learning transforms data into insights"
embedding_vector = embed(text)
print(f"Text: '{text}'")
print(f"Embedding dimension: {embedding_dimension(embedding_vector)}")
print(f"First 5 values: {[round(v, 4) for v in embedding_vector[:5]]}")
Running this code shows that the text is converted into a vector of a specific size. In this case, the default model produces a 384-dimensional vector. This means every piece of text we process with this model will be represented as a point in a 384-dimensional space.
Text: 'Machine learning transforms data into insights'
Embedding dimension: 384
First 5 values: [0.0347, -0.0121, 0.0563, -0.0205, 0.0488]
Different embedding models produce vectors of different sizes, and the choice of model can affect the quality of your semantic search. For instance, OpenAI's text-embedding-3-small model produces a 1536-dimensional vector, while Sentence Transformer models like all-MiniLM-L6-v2 produce 384-dimensional vectors.
Visualizing a 384-dimensional space is impossible, but we can project the idea down to two dimensions to understand the relationships. In this space, words and phrases are organized based on their learned associations from large amounts of text data.
Groups of related terms cluster together in the vector space, allowing for mathematical comparisons of their meanings.
In this simplified space, terms related to royalty are close together, while terms for fruits are in a separate cluster. This spatial arrangement is what allows us to perform semantic search. When a user queries "royal family", its vector will land in the "Royalty" cluster, and we can mathematically find its nearest neighbors, like "King" and "Queen".
An embedding vector has two primary properties: its magnitude (or length) and its direction. The direction points to a specific location in the meaning space, while the magnitude can sometimes indicate importance or intensity, though this varies by model.
For similarity comparisons, we are most often interested in the angle between two vectors, not their magnitudes. If two vectors point in the same direction, their corresponding texts are semantically identical, regardless of their lengths. To make this comparison easier and more reliable, embeddings are typically normalized. Normalization scales a vector so that its magnitude becomes 1, turning it into a "unit vector". This process ensures that when we later calculate similarity, we are only comparing the directions of the vectors.
You can inspect a vector's properties using the helper functions available in the embedding module.
from kerb.embedding import embed, vector_magnitude, normalize_vector
vec = embed("Embeddings convert text to numerical vectors")
print(f"Original vector magnitude: {vector_magnitude(vec):.6f}")
normalized_vec = normalize_vector(vec)
print(f"Normalized vector magnitude: {vector_magnitude(normalized_vec):.6f}")
The output confirms that after normalization, the vector's magnitude is 1.0.
Original vector magnitude: 1.000000
Normalized vector magnitude: 1.000000
Kerb's embed() and embed_batch() functions return normalized vectors by default, so you typically do not need to perform this step manually. However, understanding normalization is important for grasping how similarity metrics, which we will cover next, operate on these vectors.
With our text chunks now converted into meaningful numerical vectors, we are ready to implement the core logic of a retrieval system: measuring the similarity between a query and our documents.
Was this section helpful?
text-embedding-3-small.© 2026 ApX Machine LearningEngineered with