As we saw in the introduction to this chapter, converting text into a numerical format is essential for machine learning models. The simplest approach is often the Bag-of-Words (BoW) model. In BoW, we represent each document as a vector where each element corresponds to the count of a particular word from the overall vocabulary.
Imagine two simple sentences:
The vocabulary is {"The", "cat", "sat", "on", "mat", "dog", "chased"}. Ignoring case and punctuation for simplicity, the BoW vectors would be:
[2, 1, 1, 1, 1, 0, 0]
(corresponding to the counts of each word in the vocabulary)[2, 1, 0, 0, 0, 1, 1]
This representation is intuitive and easy to compute. However, it has a significant limitation: it treats every word equally. Notice the word "The" has the highest count (2) in both sentences. Does "The" tell us much about the specific content or difference between these sentences? Probably not. Common words like "the", "is", "in", "a" often dominate the counts but carry less discriminative information compared to words like "sat", "mat", "dog", or "chased". BoW vectors can become heavily skewed by these frequent, yet often uninformative, words.
Furthermore, BoW only considers word counts within a single document. It doesn't account for how common or rare a word is across the entire collection of documents (the corpus). A word might appear many times in one specific document, making it seem important based on BoW alone. But if that same word also appears frequently in almost every other document in the corpus, its ability to distinguish that specific document is diminished. Conversely, a word appearing moderately frequently in one document but very rarely elsewhere might be a strong indicator of that document's unique topic.
To address these limitations, we need a way to adjust the numerical representation to reflect the importance of a word, not just its raw frequency. We want to give higher weights to words that are frequent in a particular document but relatively rare across the entire corpus. This is precisely the motivation behind the Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme.
TF-IDF aims to quantify how relevant a word is to a document within a collection of documents. It achieves this by combining two distinct metrics:
Let's look at each component.
Term Frequency quantifies the prominence of a term within a single document. The basic idea is straightforward: the more times a term t appears in a document d, the more likely it is that t is relevant to the topic of d.
There are several ways to calculate TF:
The choice of TF calculation can depend on the specific application and dataset characteristics. Frequency normalization is very common.
While TF tells us about a term's frequency within one document, Inverse Document Frequency tells us about its rarity or commonness across the entire corpus D. The intuition is that terms appearing in many documents (like "the", "a", "is") are less likely to be discriminative or informative for distinguishing between documents. Terms that appear in only a few documents are likely more specific and thus more informative.
IDF assigns a higher value to rarer terms and a lower value to common terms. It's typically calculated as follows:
IDF(t,D)=log(dftN)Where:
The logarithm (usually natural log or log base 10) is used to dampen the effect of IDF. Without it, very rare words could completely dominate the weighting.
Note: If a term appears in every document (dft=N), the IDF becomes log(1)=0, effectively removing the term's contribution. If a term appears in no documents (e.g., a new word encountered during testing that wasn't in the training corpus), dft would be 0, leading to division by zero. To handle this, variations often add 1 to the denominator or use "smoothing":
IDFsmooth(t,D)=log(1+dftN)+1or
IDFsmooth(t,D)=log(1+dft1+N)+1The "+ 1" added at the end ensures that terms appearing in many documents still have a small positive weight, preventing potential issues in downstream calculations.
The TF-IDF score for a term t in a document d within a corpus D is simply the product of its TF score and its IDF score:
TFIDF(t,d,D)=TF(t,d)×IDF(t,D)This combined score achieves the desired outcome:
Consider our simple example again: "The cat sat on the mat." and "The dog chased the cat.".
This simple example illustrates how TF-IDF automatically down-weights common words like "the" and "cat" (in this tiny corpus) and gives more prominence to the words that differentiate the documents ("sat", "mat", "dog", "chased").
In practice, instead of manually calculating these, we use libraries like Scikit-learn, which provide efficient implementations (TfidfVectorizer
). By moving from simple BoW counts to TF-IDF scores, we create a more nuanced and often more effective numerical representation of text, capturing not just word presence but also a measure of word importance relative to the document and the corpus. This refined representation serves as a better foundation for subsequent steps like training machine learning models, which we will cover later. We will now explore how to calculate these scores practically and discuss other techniques like N-grams to further enhance our text features.
© 2025 ApX Machine Learning