All Courses

Calculating TF-IDF Scores

As we saw, simply counting words (like in Bag-of-Words) gives equal importance to all terms, regardless of their predictive power or uniqueness. A word like "the" might appear frequently in many documents but tells us little about the specific topic of any single document. To create more informative features, we need a way to quantify how significant a word is to a particular document within a larger collection (corpus). This is precisely what Term Frequency-Inverse Document Frequency (TF-IDF) achieves. It assigns weights to terms based not just on their frequency within a document but also on their rarity across all documents.

Let's break down how this score is calculated. It involves two main components: Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency (TF)

Term Frequency measures how often a term $t$ appears within a specific document $d$ . The intuition is simple: a word that appears multiple times in a document is likely more relevant to that document's content than a word appearing only once.

However, raw counts can be misleading. A document twice as long as another might naturally have higher counts for the same terms, even if the term's relative importance is similar. To account for document length, Term Frequency is often normalized. The most common normalization method is to divide the raw count of the term by the total number of terms in the document:

TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}

Example: Consider Document 1: "The cat sat on the mat." Total terms = 6. The term "the" appears 2 times. The term "cat" appears 1 time.

$TF(\text{"the"}, \text{Doc1}) = \frac{2}{6} \approx 0.333$ $TF(\text{"cat"}, \text{Doc1}) = \frac{1}{6} \approx 0.167$

Other variations of TF exist, such as using the raw count directly, using a boolean frequency (1 if the term exists, 0 otherwise), or applying logarithmic scaling (e.g., $1 + \log(\text{raw count})$ ) to dampen the effect of very high term counts. The normalized version shown above is widely used because it prevents longer documents from dominating the term weights solely due to their length.

Inverse Document Frequency (IDF)

Inverse Document Frequency measures how informative a term $t$ is. It assesses the rarity of a term across the entire corpus $D$ (the collection of all documents). The idea is that terms appearing in many documents (like common words or stop words) are less informative than terms appearing in only a few documents. These rarer terms help distinguish specific documents from the rest.

The standard formula for IDF is:

IDF(t, D) = \log\left(\frac{\text{Total number of documents } |D|}{\text{Number of documents containing term } t}\right)

Let's examine the components:

$|D|$ : The total number of documents in the corpus.
Number of documents containing term $t$ : This is often denoted as $df(t)$ , the document frequency of term $t$ .
$\log$ : The logarithm (typically the natural logarithm, $\ln$ ) is used to scale down the IDF values. This prevents very rare terms (those appearing in only one or two documents) from completely dominating the weight, making the scoring less sensitive to outlier term frequencies.

Notice that if a term appears in all documents, the ratio inside the log becomes $\frac{|D|}{|D|} = 1$ , and $\log(1) = 0$ . This means common terms get a very low IDF score, effectively diminishing their weight. Conversely, if a term appears in very few documents, the ratio is large, resulting in a high IDF score.

IDF Smoothing: A potential issue arises if a term appears in every document, leading to $IDF = 0$ . More problematically, if a term from our vocabulary (perhaps encountered during testing) never appeared in the training corpus used to calculate IDF, the denominator $df(t)$ would be 0, leading to division by zero.

To prevent these issues, a common modification, often called "IDF smoothing," is applied by adding 1 to the denominator:

IDF_{smooth}(t, D) = \log\left(\frac{|D|}{1 + df(t)}\right)

Many implementations, including scikit-learn's default, go a step further and also add 1 to the result of the logarithm (after adding 1 to the denominator), ensuring all IDF values are positive and giving some weight even to terms present in all documents:

IDF_{sklearn}(t, D) = \log\left(\frac{|D|}{1 + df(t)}\right) + 1

Using one of these smoothed versions is generally recommended in practice.

Example (using $IDF_{smooth}$ ): Consider a corpus with 100 documents ( $|D| = 100$ ).

Term "the" appears in 95 documents ( $df(\text{"the"}) = 95$ ).
Term "cat" appears in 10 documents ( $df(\text{"cat"}) = 10$ ).
Term " VAE" appears in 2 documents ( $df(\text{"VAE"}) = 2$ ).

$IDF_{smooth}(\text{"the"}, D) = \log\left(\frac{100}{1 + 95}\right) = \log\left(\frac{100}{96}\right) \approx \log(1.04) \approx 0.04$ $IDF_{smooth}(\text{"cat"}, D) = \log\left(\frac{100}{1 + 10}\right) = \log\left(\frac{100}{11}\right) \approx \log(9.09) \approx 2.21$ $IDF_{smooth}(\text{"VAE"}, D) = \log\left(\frac{100}{1 + 2}\right) = \log\left(\frac{100}{3}\right) \approx \log(33.33) \approx 3.51$

As expected, the common term "the" gets a very low IDF score, while the rarer term "VAE" gets a much higher score.

Combining TF and IDF: The TF-IDF Score

The final TF-IDF score for a term $t$ in a document $d$ within a corpus $D$ is calculated by multiplying its Term Frequency (TF) by its Inverse Document Frequency (IDF):

TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)

This score reflects both the local importance of the term within the document (TF) and its global importance across the corpus (IDF).

High TF-IDF Score: Awarded to terms that appear frequently in a specific document ( $TF$ is high) but infrequently across the entire corpus ( $IDF$ is high). These terms are good indicators of the document's specific content.
Low TF-IDF Score: Occurs for terms that are either uncommon in the document ( $TF$ is low) or very common across the corpus ( $IDF$ is low), or both.

Example (Combining previous calculations): Using Document 1: "The cat sat on the mat." and our example corpus ( $|D|=100$ , $df(\text{"the"})=95$ , $df(\text{"cat"})=10$ ), and using the smoothed IDF:

$TF(\text{"the"}, \text{Doc1}) \approx 0.333$
$IDF_{smooth}(\text{"the"}, D) \approx 0.04$
$TF\text{-}IDF(\text{"the"}, \text{Doc1}, D) \approx 0.333 \times 0.04 \approx 0.013$
$TF(\text{"cat"}, \text{Doc1}) \approx 0.167$
$IDF_{smooth}(\text{"cat"}, D) \approx 2.21$
$TF\text{-}IDF(\text{"cat"}, \text{Doc1}, D) \approx 0.167 \times 2.21 \approx 0.369$

Even though "the" appeared more frequently in Document 1, its very low IDF score results in a much lower final TF-IDF weight compared to "cat", which is less frequent in this specific document but much rarer overall.

Creating TF-IDF Vectors

The TF-IDF calculation is performed for every term in the vocabulary across every document in the corpus. The result is typically represented as a matrix where rows correspond to documents and columns correspond to terms (from the vocabulary). Each cell $(d, t)$ in this matrix contains the TF-IDF score for term $t$ in document $d$ .

          Term 1 | Term 2 | ... | Term V
Doc 1   | tfidf_11 | tfidf_12 | ... | tfidf_1V |
Doc 2   | tfidf_21 | tfidf_22 | ... | tfidf_2V |
...     | ...      | ...      | ... | ...      |
Doc N   | tfidf_N1 | tfidf_N2 | ... | tfidf_NV |

Each row of this matrix is a TF-IDF vector representing the corresponding document. These vectors capture the weighted importance of terms and serve as the numerical input features for machine learning algorithms. Often, these vectors are further normalized (e.g., using L2 normalization) so that all document vectors have a unit length, which can improve the performance of algorithms sensitive to feature magnitude, like distance-based classifiers.

By calculating TF-IDF scores, we transform raw text into meaningful numerical vectors that reflect term importance both locally and globally, providing a solid foundation for many NLP tasks.

Was this section helpful?