Now that we've discussed the theoretical underpinnings of representing text numerically, let's put these concepts into practice. This section focuses on using standard Python libraries, particularly scikit-learn
, to generate TF-IDF features and incorporate N-grams. We assume you have a working Python environment with scikit-learn
installed.
First, let's import the necessary tool: TfidfVectorizer
from scikit-learn
. We'll also define a small sample corpus to work with.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Sample documents
corpus = [
"The quick brown fox jumps over the lazy dog.",
"The lazy dog slept in the sun.",
"The quick brown cat ran away.",
"Never jump over the lazy dog quickly!" # Added variation
]
# Optional: Display the corpus
print("Sample Corpus:")
for i, doc in enumerate(corpus):
print(f"Document {i+1}: {doc}")
This simple corpus contains four short documents. Our goal is to convert these text strings into a matrix where rows represent documents and columns represent features (words weighted by TF-IDF).
The TfidfVectorizer
combines tokenization, counting, and TF-IDF transformation into a single object. Let's initialize it and fit it to our corpus.
# 1. Initialize the TfidfVectorizer
# Default settings: lowercase=True, token_pattern=r"(?u)\b\w\w+\b", stop_words=None, ngram_range=(1, 1)
tfidf_vectorizer = TfidfVectorizer()
# 2. Fit the vectorizer to the corpus and transform the data
# fit_transform() learns the vocabulary and IDF, then returns the TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
# The output is a sparse matrix (efficient for high dimensions)
print("\nShape of TF-IDF matrix:", tfidf_matrix.shape)
# Output: Shape of TF-IDF matrix: (4, 15)
# This means 4 documents and 15 unique terms (features) in the vocabulary after default processing.
The fit_transform
method first learns the vocabulary (all unique tokens meeting the criteria) and calculates the Inverse Document Frequency (IDF) for each term across the entire corpus. Then, it transforms the corpus into a document-term matrix where each cell contains the TF-IDF score for a specific term in a specific document.
The resulting tfidf_matrix
is typically a SciPy sparse matrix. This is memory-efficient because most entries in a document-term matrix are zero (most words don't appear in most documents).
To understand the matrix better, let's look at the learned vocabulary and the dense representation of the matrix (for small examples).
# Get the vocabulary (mapping from term to column index)
feature_names = tfidf_vectorizer.get_feature_names_out()
print("\nVocabulary (Feature Names):")
print(feature_names)
# Output: ['away' 'brown' 'cat' 'dog' 'fox' 'in' 'jumps' 'lazy' 'never' 'over' 'quick' 'quickly' 'ran' 'slept' 'sun' 'the']
# Note: 'is', 'a' might be removed by default token pattern or stop words if enabled.
# Let's re-run without implicit stop word removal for clarity in this example:
# tfidf_vectorizer = TfidfVectorizer(stop_words=None)
# tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
# feature_names = tfidf_vectorizer.get_feature_names_out()
# This output assumes default tokenization and no stop words explicitly removed by the vectorizer init
# Convert the sparse matrix to a dense NumPy array for inspection
dense_tfidf_matrix = tfidf_matrix.toarray()
# Display as a DataFrame for readability
df_tfidf = pd.DataFrame(dense_tfidf_matrix, columns=feature_names, index=[f"Doc_{i+1}" for i in range(len(corpus))])
print("\nTF-IDF Matrix (Dense):")
print(df_tfidf.round(2)) # Round for display
You'll notice that terms common across many documents (like "the") tend to have lower IDF values, potentially resulting in lower TF-IDF scores compared to rarer, more discriminative terms (like "fox" or "cat"), even if their term frequency (TF) is similar within a document. For instance, "lazy" and "dog" appear in three documents, while "fox" appears only in one. Their TF-IDF scores will reflect this difference in document frequency.
Let's visualize the TF-IDF scores for a few selected words across the documents.
# Select a few terms to visualize
terms_to_plot = ['dog', 'fox', 'lazy', 'quick', 'the']
term_indices = [list(feature_names).index(term) for term in terms_to_plot if term in feature_names]
term_labels = [feature_names[i] for i in term_indices]
# Extract scores for these terms
scores_to_plot = dense_tfidf_matrix[:, term_indices]
# Create data for Plotly bar chart
plot_data = []
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'] # Default Plotly colors
for i, term in enumerate(term_labels):
plot_data.append({
"x": [f"Doc_{j+1}" for j in range(len(corpus))],
"y": scores_to_plot[:, i],
"name": term,
"type": "bar",
"marker": {"color": colors[i % len(colors)]}
})
TF-IDF scores for the terms 'dog', 'fox', 'lazy', 'quick', and 'the' across the four sample documents. Note how 'the' has relatively lower scores despite appearing frequently, while 'fox' has a high score only in Doc 1 where it appears. 'Dog' and 'lazy' have similar patterns as they often co-occur.
Standard TF-IDF (using unigrams, or single words) loses word order information. "quick brown fox" and "fox brown quick" would have identical representations. N-grams help capture local context by treating sequences of words as features.
We can configure TfidfVectorizer
to include N-grams using the ngram_range
parameter. It takes a tuple (min_n, max_n)
. For example, ngram_range=(1, 2)
generates unigrams and bigrams.
# Initialize vectorizer to include unigrams and bigrams
ngram_vectorizer = TfidfVectorizer(ngram_range=(1, 2)) # Generate 1-grams and 2-grams
# Fit and transform
ngram_tfidf_matrix = ngram_vectorizer.fit_transform(corpus)
# Check the new shape and some feature names
print("\nShape of TF-IDF matrix with N-grams (1, 2):", ngram_tfidf_matrix.shape)
# Output: Shape of TF-IDF matrix with N-grams (1, 2): (4, 31) -> More features now!
ngram_feature_names = ngram_vectorizer.get_feature_names_out()
print("\nSample N-gram Features:")
# Print first 10 and last 10 features for illustration
print(list(ngram_feature_names[:10]) + list(ngram_feature_names[-10:]))
# Output includes features like 'brown fox', 'lazy dog', 'quick brown', 'the lazy', etc.
As you can see, the number of features increases significantly when adding N-grams. This captures more context (e.g., distinguishing "lazy dog" from just "lazy" and "dog" appearing separately) but also increases the dimensionality of the feature space. This can make computation more expensive and sometimes requires more data to avoid overfitting.
Common practice involves using unigrams and bigrams (ngram_range=(1, 2)
), and occasionally trigrams (ngram_range=(1, 3)
), depending on the task and corpus size. You can limit the total number of features using the max_features
parameter in TfidfVectorizer
, which keeps only the top features ordered by term frequency across the corpus.
# Example: Limiting features while using bigrams
limited_ngram_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=20)
limited_ngram_matrix = limited_ngram_vectorizer.fit_transform(corpus)
print("\nShape with N-grams (1, 2) and max_features=20:", limited_ngram_matrix.shape)
# Output: Shape with N-grams (1, 2) and max_features=20: (4, 20)
limited_features = limited_ngram_vectorizer.get_feature_names_out()
print("\nLimited Feature Set Sample:")
print(limited_features[:10]) # Show some of the selected top features
In this hands-on section, you learned how to:
scikit-learn
's TfidfVectorizer
to convert raw text documents into a numerical TF-IDF matrix.ngram_range
parameter.max_features
.Generating meaningful numerical features like TF-IDF vectors, potentially enhanced with N-grams, is a fundamental step in preparing text data for machine learning models, which we will explore in the next chapter.
© 2025 ApX Machine Learning