As we saw, simply counting words (like in Bag-of-Words) gives equal importance to all terms, regardless of their predictive power or uniqueness. A word like "the" might appear frequently in many documents but tells us little about the specific topic of any single document. To create more informative features, we need a way to quantify how significant a word is to a particular document within a larger collection (corpus). This is precisely what Term Frequency-Inverse Document Frequency (TF-IDF) achieves. It assigns weights to terms based not just on their frequency within a document but also on their rarity across all documents.
Let's break down how this score is calculated. It involves two main components: Term Frequency (TF) and Inverse Document Frequency (IDF).
Term Frequency measures how often a term t appears within a specific document d. The intuition is simple: a word that appears multiple times in a document is likely more relevant to that document's content than a word appearing only once.
However, raw counts can be misleading. A document twice as long as another might naturally have higher counts for the same terms, even if the term's relative importance is similar. To account for document length, Term Frequency is often normalized. The most common normalization method is to divide the raw count of the term by the total number of terms in the document:
TF(t,d)=Total number of terms in document dNumber of times term t appears in document dExample: Consider Document 1: "The cat sat on the mat." Total terms = 6. The term "the" appears 2 times. The term "cat" appears 1 time.
TF("the",Doc1)=62≈0.333 TF("cat",Doc1)=61≈0.167
Other variations of TF exist, such as using the raw count directly, using a boolean frequency (1 if the term exists, 0 otherwise), or applying logarithmic scaling (e.g., 1+log(raw count)) to dampen the effect of very high term counts. The normalized version shown above is widely used because it prevents longer documents from dominating the term weights solely due to their length.
Inverse Document Frequency measures how informative a term t is. It assesses the rarity of a term across the entire corpus D (the collection of all documents). The idea is that terms appearing in many documents (like common words or stop words) are less informative than terms appearing in only a few documents. These rarer terms help distinguish specific documents from the rest.
The standard formula for IDF is:
IDF(t,D)=log(Number of documents containing term tTotal number of documents ∣D∣)Let's examine the components:
Notice that if a term appears in all documents, the ratio inside the log becomes ∣D∣∣D∣=1, and log(1)=0. This means common terms get a very low IDF score, effectively diminishing their weight. Conversely, if a term appears in very few documents, the ratio is large, resulting in a high IDF score.
IDF Smoothing: A potential issue arises if a term appears in every document, leading to IDF=0. More problematically, if a term from our vocabulary (perhaps encountered during testing) never appeared in the training corpus used to calculate IDF, the denominator df(t) would be 0, leading to division by zero.
To prevent these issues, a common modification, often called "IDF smoothing," is applied by adding 1 to the denominator:
IDFsmooth(t,D)=log(1+df(t)∣D∣)Many implementations, including scikit-learn's default, go a step further and also add 1 to the result of the logarithm (after adding 1 to the denominator), ensuring all IDF values are positive and giving some weight even to terms present in all documents:
IDFsklearn(t,D)=log(1+df(t)∣D∣)+1Using one of these smoothed versions is generally recommended in practice.
Example (using IDFsmooth): Consider a corpus with 100 documents (∣D∣=100).
IDFsmooth("the",D)=log(1+95100)=log(96100)≈log(1.04)≈0.04 IDFsmooth("cat",D)=log(1+10100)=log(11100)≈log(9.09)≈2.21 IDFsmooth("VAE",D)=log(1+2100)=log(3100)≈log(33.33)≈3.51
As expected, the common term "the" gets a very low IDF score, while the rarer term "VAE" gets a much higher score.
The final TF-IDF score for a term t in a document d within a corpus D is calculated by multiplying its Term Frequency (TF) by its Inverse Document Frequency (IDF):
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)This score reflects both the local importance of the term within the document (TF) and its global importance across the corpus (IDF).
Example (Combining previous calculations): Using Document 1: "The cat sat on the mat." and our example corpus (∣D∣=100, df("the")=95, df("cat")=10), and using the smoothed IDF:
TF("the",Doc1)≈0.333
IDFsmooth("the",D)≈0.04
TF-IDF("the",Doc1,D)≈0.333×0.04≈0.013
TF("cat",Doc1)≈0.167
IDFsmooth("cat",D)≈2.21
TF-IDF("cat",Doc1,D)≈0.167×2.21≈0.369
Even though "the" appeared more frequently in Document 1, its very low IDF score results in a much lower final TF-IDF weight compared to "cat", which is less frequent in this specific document but much rarer overall.
The TF-IDF calculation is performed for every term in the vocabulary across every document in the corpus. The result is typically represented as a matrix where rows correspond to documents and columns correspond to terms (from the vocabulary). Each cell (d,t) in this matrix contains the TF-IDF score for term t in document d.
Term 1 | Term 2 | ... | Term V
Doc 1 | tfidf_11 | tfidf_12 | ... | tfidf_1V |
Doc 2 | tfidf_21 | tfidf_22 | ... | tfidf_2V |
... | ... | ... | ... | ... |
Doc N | tfidf_N1 | tfidf_N2 | ... | tfidf_NV |
Each row of this matrix is a TF-IDF vector representing the corresponding document. These vectors capture the weighted importance of terms and serve as the numerical input features for machine learning algorithms. Often, these vectors are further normalized (e.g., using L2 normalization) so that all document vectors have a unit length, which can improve the performance of algorithms sensitive to feature magnitude, like distance-based classifiers.
By calculating TF-IDF scores, we transform raw text into meaningful numerical vectors that reflect term importance both locally and globally, providing a solid foundation for many NLP tasks.
© 2025 ApX Machine Learning