While structured data like genres or actors are straightforward to use, much of the rich information about an item, such as a movie's plot summary or a product's description, is unstructured text. To build a content-based filter, we must first convert this text into a numerical format, a process called vectorization. A machine learning model cannot directly understand words like "adventure" or "galaxy," but it can operate on vectors of numbers.A simple approach is to count the occurrences of each word, often called the Bag-of-Words (BoW) model. In this model, each document becomes a vector where each element corresponds to a word count. However, this method has a significant drawback: very common words like "the," "a," and "in" will dominate the vector, despite carrying little descriptive meaning. As a result, documents with similar common words might appear more similar than they actually are. We need a method that can identify words that are not just frequent, but also descriptive of the document's content. This is precisely what Term Frequency-Inverse Document Frequency (TF-IDF) accomplishes.The Components of TF-IDFTF-IDF is a numerical statistic that reflects how important a word is to a document within a collection or corpus. It is composed of two parts: Term Frequency and Inverse Document Frequency. By multiplying these two values, we get a score that is high for words that appear often in a specific document but are rare across the entire corpus.digraph G { graph [fontname="Arial"]; node [shape=box, style="rounded,filled", fontname="Arial", fontsize=10, fillcolor="#e9ecef"]; edge [fontname="Arial", fontsize=9]; rankdir=TB; subgraph cluster_0 { label = "Corpus"; bgcolor="#f8f9fa"; doc1 [label="Document 1\n(Movie Plot)"]; doc2 [label="Document 2\n(Movie Plot)"]; doc3 [label="..."]; } subgraph cluster_1 { label = "Vectorization Process"; bgcolor="#f8f9fa"; tfidf [label="TF-IDF Calculation", shape=ellipse, style=filled, fillcolor="#a5d8ff"]; tf [label="Term Frequency (TF)\n'How often does a word appear?'", fillcolor="#b2f2bb"]; idf [label="Inverse Document Frequency (IDF)\n'How rare is the word?'", fillcolor="#ffec99"]; } subgraph cluster_2 { label = "Output"; bgcolor="#f8f9fa"; matrix [label="TF-IDF Matrix\n(Items as Vectors)", shape=cylinder, style=filled, fillcolor="#bac8ff"]; } {doc1, doc2, doc3} -> tfidf; tfidf -> tf [dir=none, style=dashed]; tfidf -> idf [dir=none, style=dashed]; tfidf -> matrix; }The TF-IDF process transforms a collection of text documents into a numerical matrix, where each document is represented by a vector of word scores.Term Frequency (TF)Term Frequency measures the relative frequency of a term in a document. It's calculated by dividing the number of times a term appears in a document by the total number of terms in that document. This normalization prevents longer documents from having an unfair advantage over shorter ones.The formula for Term Frequency is: $$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$ For example, if the word "knight" appears 3 times in a movie plot summary that contains 100 words, its TF is $3/100 = 0.03$.Inverse Document Frequency (IDF)Inverse Document Frequency measures how much information a word provides. It diminishes the weight of common terms and increases the weight of terms that are not used very much. The IDF of a term is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing that term.The formula for Inverse Document Frequency is: $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents } D}{\text{Number of documents containing term } t}\right) $$ Words like "and" or "the" will appear in almost all documents, so their IDF will be close to $\log(1)$, which is 0. A rare word, like "interstellar," which might only appear in a few sci-fi movie descriptions, will have a much higher IDF score, marking it as more significant. The logarithm helps to dampen the effect so that extremely rare words don't completely dominate the score.The Final TF-IDF ScoreThe TF-IDF score for a term in a document is simply the product of its TF and IDF scores.$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$ This final score gives a higher weight to terms that appear frequently in a document (high TF) but are rare across the entire collection of documents (high IDF). This combination effectively surfaces the words that best characterize a particular item.Implementation with Scikit-learnWhile it's useful to understand the math, you won't need to implement TF-IDF from scratch. The scikit-learn library provides a highly optimized and flexible implementation called TfidfVectorizer. This class handles the entire workflow of tokenizing text, learning the vocabulary, calculating IDF weights, and generating the final TF-IDF matrix.Here's how you can use it to transform a collection of movie plot descriptions into a TF-IDF matrix:from sklearn.feature_extraction.text import TfidfVectorizer # Sample movie plot descriptions documents = [ 'A brave knight battles a dragon to save a princess.', 'A space explorer discovers a new galaxy of brave pioneers.', A knight starts a search for a magical sword. ] # Initialize the vectorizer, removing common English stop words tfidf_vectorizer = TfidfVectorizer(stop_words='english') # Fit the vectorizer to the data and transform it tfidf_matrix = tfidf_vectorizer.fit_transform(documents) # The resulting matrix contains a TF-IDF vector for each document print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}") # To see the feature names (the learned vocabulary) print(f"Vocabulary: {tfidf_vectorizer.get_feature_names_out()}") # The matrix is sparse, print its dense representation to inspect print("\nTF-IDF Matrix (dense):") print(tfidf_matrix.toarray())The output of tfidf_vectorizer.fit_transform is a sparse matrix. This is a memory-efficient representation because most cells in a TF-IDF matrix are zero, since any given document contains only a small subset of the total vocabulary.The TfidfVectorizer has several useful parameters for fine-tuning its behavior:stop_words: Can be set to 'english' to automatically remove common English stop words.min_df: Ignores terms that have a document frequency strictly lower than the given threshold. This is useful for removing rare words or typos.max_df: Ignores terms that have a document frequency strictly higher than the given threshold. This can be used to filter out words that are too common to be descriptive.ngram_range: Allows you to consider word n-grams instead of just single words. For example, ngram_range=(1, 2) would include both single words (unigrams) and pairs of adjacent words (bigrams), which can capture more context.By applying TF-IDF, we have successfully converted our unstructured text into meaningful numerical vectors. Each row in our tfidf_matrix represents an item, ready for the next step: measuring similarity.