While structured data like genres or actors are straightforward to use, much of the rich information about an item, such as a movie's plot summary or a product's description, is unstructured text. To build a content-based filter, we must first convert this text into a numerical format, a process called vectorization. A machine learning model cannot directly understand words like "adventure" or "galaxy," but it can operate on vectors of numbers.
A simple approach is to count the occurrences of each word, often called the Bag-of-Words (BoW) model. In this model, each document becomes a vector where each element corresponds to a word count. However, this method has a significant drawback: very common words like "the," "a," and "in" will dominate the vector, despite carrying little descriptive meaning. As a result, documents with similar common words might appear more similar than they actually are. We need a method that can identify words that are not just frequent, but also descriptive of the document's content. This is precisely what Term Frequency-Inverse Document Frequency (TF-IDF) accomplishes.
TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection or corpus. It is composed of two parts: Term Frequency and Inverse Document Frequency. By multiplying these two values, we get a score that is high for words that appear often in a specific document but are rare across the entire corpus.
The TF-IDF process transforms a collection of text documents into a numerical matrix, where each document is represented by a vector of word scores.
Term Frequency measures the relative frequency of a term in a document. It's calculated by dividing the number of times a term appears in a document by the total number of terms in that document. This normalization prevents longer documents from having an unfair advantage over shorter ones.
The formula for Term Frequency is:
For example, if the word "knight" appears 3 times in a movie plot summary that contains 100 words, its TF is .
Inverse Document Frequency measures how much information a word provides. It diminishes the weight of common terms and increases the weight of terms that are not used very much. The IDF of a term is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing that term.
The formula for Inverse Document Frequency is:
Words like "and" or "the" will appear in almost all documents, so their IDF will be close to , which is 0. A rare word, like "interstellar," which might only appear in a few sci-fi movie descriptions, will have a much higher IDF score, marking it as more significant. The logarithm helps to dampen the effect so that extremely rare words don't completely dominate the score.
The TF-IDF score for a term in a document is simply the product of its TF and IDF scores.
This final score gives a higher weight to terms that appear frequently in a document (high TF) but are rare across the entire collection of documents (high IDF). This combination effectively surfaces the words that best characterize a particular item.
While it's useful to understand the math, you won't need to implement TF-IDF from scratch. The scikit-learn library provides a highly optimized and flexible implementation called TfidfVectorizer. This class handles the entire workflow of tokenizing text, learning the vocabulary, calculating IDF weights, and generating the final TF-IDF matrix.
Here's how you can use it to transform a collection of movie plot descriptions into a TF-IDF matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample movie plot descriptions
documents = [
'A brave knight battles a dragon to save a princess.',
'A space explorer discovers a new galaxy of brave pioneers.',
A knight starts a search for a magical sword.
]
# Initialize the vectorizer, removing common English stop words
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
# Fit the vectorizer to the data and transform it
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# The resulting matrix contains a TF-IDF vector for each document
print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
# To see the feature names (the learned vocabulary)
print(f"Vocabulary: {tfidf_vectorizer.get_feature_names_out()}")
# The matrix is sparse, print its dense representation to inspect
print("\nTF-IDF Matrix (dense):")
print(tfidf_matrix.toarray())
The output of tfidf_vectorizer.fit_transform is a sparse matrix. This is a memory-efficient representation because most cells in a TF-IDF matrix are zero, since any given document contains only a small subset of the total vocabulary.
The TfidfVectorizer has several useful parameters for fine-tuning its behavior:
stop_words: Can be set to 'english' to automatically remove common English stop words.min_df: Ignores terms that have a document frequency strictly lower than the given threshold. This is useful for removing rare words or typos.max_df: Ignores terms that have a document frequency strictly higher than the given threshold. This can be used to filter out words that are too common to be descriptive.ngram_range: Allows you to consider word n-grams instead of just single words. For example, ngram_range=(1, 2) would include both single words (unigrams) and pairs of adjacent words (bigrams), which can capture more context.By applying TF-IDF, we have successfully converted our unstructured text into meaningful numerical vectors. Each row in our tfidf_matrix represents an item, ready for the next step: measuring similarity.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with