To effectively build multimodal AI models, we need to transform raw data from each modality into a format that our algorithms can understand and process. For text, this means converting strings of characters, words, and sentences into numerical representations, often called features or embeddings. Raw text itself isn't directly usable by most machine learning models, which expect numerical input. Let's look at some common techniques to extract these meaningful features from text data.
Imagine trying to teach a computer to understand a sentence like "A fluffy cat sleeps on a warm rug." The computer doesn't inherently know what "fluffy," "cat," or "sleeps" means. It only sees a sequence of characters. Feature extraction is the process of converting this text into a set_of_numbers (a vector) that captures some aspect of its meaning or structure. These numerical vectors can then be fed into machine learning models, which learn patterns from them. In a multimodal system, these text features will later be combined with features extracted from images, audio, or other data types.
One of the most straightforward ways to represent text numerically is the Bag-of-Words (BoW) model. Think of it like this: you take all the words in your document, throw them into a "bag," and count how many times each word appears. The order of words and the grammar are disregarded; only the presence and frequency matter.
Here's how it generally works:
Let's consider a tiny example:
Our vocabulary (ignoring case and punctuation for simplicity, and using unique words) might be: {"the", "cat", "sat", "on", "mat", "dog", "chased"}.
BoW vector for Document 1:
[2, 1, 1, 1, 1, 0, 0]
BoW vector for Document 2:
[2, 1, 0, 0, 0, 1, 1]
Strengths of BoW:
Weaknesses of BoW:
The Bag-of-Words model treats all words equally based on their raw counts. However, some words are inherently more informative than others. For instance, words like "the," "a," or "is" appear very frequently in almost all English texts but carry little specific meaning about the document's content. TF-IDF tries to address this by giving higher weights to words that are frequent in a particular document but rare across the entire collection of documents (the corpus).
TF-IDF is a product of two statistics: Term Frequency and Inverse Document Frequency.
Term Frequency (TF): This measures how often a term (word) t appears in a document d. There are several ways to calculate TF. A simple one is the raw count, but often it's normalized to prevent a bias towards longer documents:
TF(t,d)=Total number of terms in document dNumber of times term t appears in document dInverse Document Frequency (IDF): This measures how important a term is across the entire corpus D. It scales down the weight of terms that appear in many documents and scales up the weight of terms that appear in few documents. A common way to calculate IDF is:
IDF(t,D)=log(Number of documents containing term tTotal number of documents ∣D∣)To avoid division by zero if a term isn't in any document (which shouldn't happen if the vocabulary is built from the corpus) or a term is in all documents (leading to log(1)=0), smoothing is often applied, for instance, by adding 1 to the denominator: IDF(t,D)=log(1+Number of documents containing term t∣D∣)+1. For our introductory purposes, the basic idea is key.
The TF-IDF score for a term t in document d within corpus D is then:
TFIDF(t,d,D)=TF(t,d)×IDF(t,D)A high TF-IDF score is achieved by a high term frequency (the term is common in the specific document) and a low document frequency of the term in the whole collection of documents (the term is rare overall). This makes TF-IDF effective at highlighting words that are characteristic of a particular document.
For example, in a collection of news articles, the word "election" might have a high TF-IDF score in an article specifically about an election, while the word "today" would likely have a low IDF score (and thus a lower TF-IDF score) because it appears in many articles.
Strengths of TF-IDF:
Weaknesses of TF-IDF:
While BoW and TF-IDF provide useful numerical representations, they don't capture the meaning or semantic relationships between words. For example, "happy" and "joyful" are synonyms, but in a BoW or TF-IDF model, their vectors might be completely different and unrelated if they are treated as distinct vocabulary items.
Word embeddings aim to solve this. They represent words as dense, low-dimensional vectors (e.g., 50 to 300 dimensions, compared to potentially tens of thousands for BoW/TF-IDF). The crucial aspect of these embeddings is that words with similar meanings are represented by vectors that are close to each other in this vector space.
Imagine a space where words like "king," "queen," "prince," and "princess" are located. Word embeddings learn representations suchthat the vector relationship between "king" and "queen" might be similar to the vector relationship between "man" and "woman." This is often illustrated by the famous example: vector("king") - vector("man") + vector("woman") ≈ vector("queen").
How are they different from BoW/TF-IDF?
Popular algorithms for creating word embeddings include:
A significant advantage is the availability of pre-trained word embeddings. Researchers have trained these models on vast amounts of text data (like all of Wikipedia or large news corpora). This means you can often download these pre-trained embeddings and use them directly in your models without needing to train them from scratch on your own (potentially smaller) dataset. This is incredibly useful, especially when your dataset isn't large enough to learn high-quality embeddings on its own.
For multimodal systems, these dense word (or sentence) embedding vectors serve as rich, semantically informed inputs to the parts of your neural network that will later combine this textual information with features from images or audio.
So far, we've mostly talked about representing individual words. But often, we need to represent entire sentences or documents. How do we do that?
Regardless of the technique chosen, BoW, TF-IDF, or word embeddings, the objective of text feature extraction is the same: to convert raw text into a set of numerical features. These features are the building blocks that our AI model can understand. Once we have these numerical representations for text, and similarly for other modalities like images and audio (which we'll cover next), we're one step closer to combining them in a multimodal AI system. These extracted features are the inputs that will be fed into the integration techniques and model architectures we discussed in the previous chapter.
Was this section helpful?
© 2025 ApX Machine Learning