After splitting our documents into manageable chunks in the previous step, we face a new challenge: how do we efficiently search through potentially thousands or millions of these text chunks to find the ones most relevant to a user's query? Comparing raw text strings directly is inefficient and often misses semantic connections (e.g., "dog breed information" vs. "facts about golden retrievers"). We need a way to represent the meaning of the text numerically. This is where text embedding models come into play.
Text embedding models are specialized neural networks designed to transform pieces of text (words, sentences, paragraphs, or entire documents) into numerical representations called vectors. Think of these vectors as coordinates locating the text within a high-dimensional "meaning space".
The fundamental concept behind effective text embeddings is that semantic similarity corresponds to spatial proximity. Texts that have similar meanings will be mapped to vectors that are close to each other in this vector space, while texts with different meanings will be farther apart. This numerical representation allows us to perform meaningful comparisons and searches computationally.
For example, sentences like "What are the symptoms of the flu?" and "How do I know if I have influenza?" would likely produce embedding vectors that are very close together. In contrast, a sentence like "What is the capital of France?" would result in a vector located far away from the flu-related ones.
A simplified 2D representation of a high-dimensional embedding space. Similar concepts like "dog" and "puppy" are mapped closely together, while unrelated concepts like "car" are distant.
These vector representations are not created arbitrarily. They are the output of sophisticated deep learning models, often based on the Transformer architecture (which also powers many LLMs like GPT and BERT). These embedding models are trained on vast amounts of text data, learning to capture the contextual nuances, semantics, and relationships between words and concepts.
Popular options for generating embeddings include:
text-embedding-ada-002
, text-embedding-3-small
, text-embedding-3-large
), Cohere, and Google offer embedding generation through API calls. This simplifies integration but involves network latency and potential costs per API call.sentence-transformers
(built on Hugging Face's transformers
) provide access to a wide range of pre-trained models (e.g., all-MiniLM-L6-v2
, multi-qa-mpnet-base-dot-v1
) that you can run locally or on your own infrastructure. This offers more control and can be cost-effective for large volumes but requires managing the model and computational resources.The choice of embedding model can significantly impact your RAG system's performance. Factors to consider include:
Once we have text represented as vectors, how do we measure "closeness" in this high-dimensional space? The most common metric is Cosine Similarity. Instead of measuring the Euclidean distance between the vector endpoints (which is sensitive to vector magnitude), cosine similarity measures the cosine of the angle between two vectors. It effectively tells us if the vectors are pointing in the same direction.
For two vectors A and B, the cosine similarity is calculated as:
similarity=cos(θ)=∥A∥∥B∥A⋅B=∑i=1nAi2∑i=1nBi2∑i=1nAiBiWhere:
The result ranges from -1 to 1:
In practice, for embeddings generated by many common models, the values typically fall between 0 and 1, as the models are often trained to represent semantic similarity in this range.
Text embeddings are the linchpin of the "Retrieval" step in RAG. The process works like this:
This retrieved information is then used to "augment" the prompt sent to the LLM, providing it with the necessary context to answer the user's query based on the external data.
Now that we understand how to represent text chunks as searchable vectors, we need an efficient way to store these vectors and perform similarity searches rapidly, especially when dealing with large datasets. This leads us to the concept of Vector Stores.
© 2025 ApX Machine Learning