Understanding Text Embeddings

To process large volumes of information, documents are often segmented into manageable units. These units, however, are initially just raw text. For a computer to find relevant information, it needs a way to understand that a user's query for "ways to make a computer think" is related to a text unit describing "artificial intelligence". Simple keyword matching would fail this test. This is where text embeddings come in.

Text embeddings are numerical representations of text, typically in the form of vectors, which are lists of floating-point numbers. The significant aspect of these vectors is that they capture the semantic meaning and context of the original text. An embedding model, a specialized type of neural network, processes a piece of text and outputs a dense vector of a fixed size.

Think of it like giving every word or sentence a unique coordinate in a high-dimensional "meaning space". In this space, texts with similar meanings are located close to each other, while dissimilar texts are far apart. For example, the vectors for "car", "automobile", and "vehicle" would be clustered together, whereas the vector for "banana" would be in a completely different region of the space.

With the Kerb toolkit, generating an embedding is a straightforward function call. The embed function takes a string of text and returns its vector representation.

from kerb.embedding import embed, embedding_dimension

text = "Machine learning transforms data into insights"
embedding_vector = embed(text)

print(f"Text: '{text}'")
print(f"Embedding dimension: {embedding_dimension(embedding_vector)}")
print(f"First 5 values: {[round(v, 4) for v in embedding_vector[:5]]}")

Running this code shows that the text is converted into a vector of a specific size. In this case, the default model produces a 384-dimensional vector. This means every piece of text we process with this model will be represented as a point in a 384-dimensional space.

Text: 'Machine learning transforms data into insights'
Embedding dimension: 384
First 5 values: [0.0347, -0.0121, 0.0563, -0.0205, 0.0488]

Different embedding models produce vectors of different sizes, and the choice of model can affect the quality of your semantic search. For instance, OpenAI's text-embedding-3-small model produces a 1536-dimensional vector, while Sentence Transformer models like all-MiniLM-L6-v2 produce 384-dimensional vectors.

The Vector Space of Meaning

Visualizing a 384-dimensional space is impossible, but we can project the idea down to two dimensions to understand the relationships. In this space, words and phrases are organized based on their learned associations from large amounts of text data.

Groups of related terms cluster together in the vector space, allowing for mathematical comparisons of their meanings.

In this simplified space, terms related to royalty are close together, while terms for fruits are in a separate cluster. This spatial arrangement is what allows us to perform semantic search. When a user queries "royal family", its vector will land in the "Royalty" cluster, and we can mathematically find its nearest neighbors, like "King" and "Queen".

Vector Properties and Normalization

An embedding vector has two primary properties: its magnitude (or length) and its direction. The direction points to a specific location in the meaning space, while the magnitude can sometimes indicate importance or intensity, though this varies by model.

For similarity comparisons, we are most often interested in the angle between two vectors, not their magnitudes. If two vectors point in the same direction, their corresponding texts are semantically identical, regardless of their lengths. To make this comparison easier and more reliable, embeddings are typically normalized. Normalization scales a vector so that its magnitude becomes 1, turning it into a "unit vector". This process ensures that when we later calculate similarity, we are only comparing the directions of the vectors.

You can inspect a vector's properties using the helper functions available in the embedding module.

from kerb.embedding import embed, vector_magnitude, normalize_vector

vec = embed("Embeddings convert text to numerical vectors")
print(f"Original vector magnitude: {vector_magnitude(vec):.6f}")

normalized_vec = normalize_vector(vec)
print(f"Normalized vector magnitude: {vector_magnitude(normalized_vec):.6f}")

The output confirms that after normalization, the vector's magnitude is 1.0.

Original vector magnitude: 1.000000
Normalized vector magnitude: 1.000000

Kerb's embed() and embed_batch() functions return normalized vectors by default, so you typically do not need to perform this step manually. However, understanding normalization is important for grasping how similarity metrics, which we will cover next, operate on these vectors.

With our text chunks now converted into meaningful numerical vectors, we are ready to implement the core logic of a retrieval system: measuring the similarity between a query and our documents.

Was this section helpful?

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Nils Reimers and Iryna Gurevych, 2019 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics) DOI: 10.18653/v1/D19-1410 - Introduces the Sentence-BERT architecture for state-of-the-art sentence embeddings that are optimized for semantic similarity tasks.
OpenAI Embeddings API Reference, OpenAI, 2024 (OpenAI) - Provides detailed information and usage instructions for OpenAI's text embedding models, including text-embedding-3-small.
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky and James H. Martin, 2025 (Pearson) - Provides a comprehensive introduction to natural language processing, including foundational concepts of text embeddings and vector space models (4th edition draft).
Embedding Representations, Google Developers, 2023 (Google) - Offers a conceptual introduction to embeddings, explaining how they represent meaning and their role in machine learning.