Alright, let's put the theory into practice. In this hands-on exercise, you'll build a complete text classification pipeline using the concepts covered in this chapter and the previous ones. We'll take a dataset, preprocess it, extract features, train a classifier, and evaluate its performance.
Our goal is to build a simple spam detector. We'll use a small dataset containing text messages labeled as either "spam" or "ham" (not spam).
First, ensure you have the necessary libraries installed, particularly scikit-learn
and pandas
. If not, you can typically install them using pip:
pip install scikit-learn pandas
Let's assume our dataset is in a simple CSV file named spam_data.csv
with two columns: label
('ham' or 'spam') and text
.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression # Example of another classifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import plotly.graph_objects as go
import numpy as np
# Load the dataset (replace 'spam_data.csv' with your actual file path if different)
# For demonstration, let's create a small sample DataFrame
data = {'label': ['ham', 'spam', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham'],
'text': ['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C apply 08452810075over18s',
'U dun say so early hor... U c already then say...',
'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
'Nah I dont think he goes to usf, he lives around here though',
'Even my brother is not like to speak with me. They treat me like aids patent.',
'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18',
'I am gonna be home soon and i dont want to talk about this stuff anymore tonight, k? Ive cried enough today.',
'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info',
'I have been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise.']}
df = pd.DataFrame(data)
# Display the first few rows and check class distribution
print("Dataset Head:")
print(df.head())
print("\nClass Distribution:")
print(df['label'].value_counts())
# Separate features (text) and target (label)
X = df['text']
y = df['label']
# Split data into training and testing sets
# Using a small test_size for this example dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(f"\nTraining set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
Here, we load the data using pandas, inspect it, and then split it into training and testing sets using train_test_split
. Using stratify=y
is good practice, especially for potentially imbalanced datasets, as it ensures the proportion of labels is roughly the same in both the train and test sets.
As discussed, directly applying preprocessing and feature extraction steps before training can lead to data leakage if not done carefully (e.g., fitting TfidfVectorizer
on the whole dataset before splitting). scikit-learn
's Pipeline
object is excellent for chaining these steps together, ensuring that transformations are learned only from the training data.
We'll create a pipeline that first applies TF-IDF vectorization and then trains a Multinomial Naive Bayes (MNB) classifier. MNB is often a good baseline for text classification tasks.
# Create a pipeline with TF-IDF Vectorizer and Multinomial Naive Bayes
text_clf_nb = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')), # Add stop word removal
('clf', MultinomialNB()),
])
# Train the entire pipeline on the training data
print("\nTraining Naive Bayes pipeline...")
text_clf_nb.fit(X_train, y_train)
print("Training complete.")
In this pipeline:
TfidfVectorizer(stop_words='english')
: Converts text documents into a matrix of TF-IDF features. We also include basic English stop word removal directly within the vectorizer. It will learn the vocabulary and IDF weights only from X_train
when fit
is called.MultinomialNB()
: The classifier that will be trained on the TF-IDF features.When text_clf_nb.fit(X_train, y_train)
is executed, the training data X_train
flows through the pipeline: first, the tfidf
step transforms it, and then the resulting features are used to train the clf
step (Naive Bayes).
Now, let's use the trained pipeline to make predictions on our held-out test set (X_test
) and evaluate the performance using the metrics discussed earlier.
# Make predictions on the test set
print("\nMaking predictions on the test set...")
y_pred_nb = text_clf_nb.predict(X_test)
# Evaluate the Naive Bayes model
print("\nNaive Bayes Model Evaluation:")
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Accuracy: {accuracy_nb:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_nb))
print("\nConfusion Matrix:")
cm_nb = confusion_matrix(y_test, y_pred_nb, labels=text_clf_nb.classes_)
print(cm_nb)
# Function to create Plotly confusion matrix figure
def plot_confusion_matrix(cm, labels):
# Use a blue color scale
colorscale = [
[0.0, '#e9ecef'], # light gray for 0
[0.5, '#74c0fc'], # light blue
[1.0, '#1c7ed6'] # dark blue for max
]
fig = go.Figure(data=go.Heatmap(
z=cm,
x=labels,
y=labels,
hoverongaps=False,
colorscale=colorscale,
showscale=False # Hide color bar for simplicity
))
# Add annotations for cell values
annotations = []
for i, row in enumerate(cm):
for j, value in enumerate(row):
annotations.append(
go.layout.Annotation(
text=str(value),
x=labels[j],
y=labels[i],
xref="x1",
yref="y1",
showarrow=False,
font=dict(color="black" if value < (cm.max() / 2) else "white") # Adjust text color for contrast
)
)
fig.update_layout(
title='Confusion Matrix (Naive Bayes)',
xaxis_title="Predicted Label",
yaxis_title="True Label",
xaxis=dict(side='bottom'), # Place x-axis labels at the bottom
yaxis=dict(autorange='reversed'), # Display true labels from top to bottom
width=450, height=400, # Adjust size as needed
margin=dict(l=50, r=50, t=50, b=50),
annotations=annotations
)
return fig
# Generate and display the confusion matrix plot
fig_nb = plot_confusion_matrix(cm_nb, text_clf_nb.classes_)
# In a real web environment or notebook, you would display fig_nb here.
# For this format, we output the JSON representation.
print("\nPlotly Confusion Matrix JSON:")
print(fig_nb.to_json(pretty=False))
Confusion matrix showing predicted vs. true labels for the spam classification task using the Naive Bayes model. Based on this very small sample, the model correctly identified 'ham' but misclassified 'spam'.
Interpreting the Results (Example based on hypothetical output):
The pipeline makes it easy to swap out components. Let's try Logistic Regression instead of Naive Bayes.
# Create and train a pipeline with Logistic Regression
text_clf_lr = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression(solver='liblinear', random_state=42)), # Using liblinear solver suitable for smaller datasets
])
print("\nTraining Logistic Regression pipeline...")
text_clf_lr.fit(X_train, y_train)
print("Training complete.")
# Make predictions and evaluate
print("\nMaking predictions with Logistic Regression...")
y_pred_lr = text_clf_lr.predict(X_test)
print("\nLogistic Regression Model Evaluation:")
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"Accuracy: {accuracy_lr:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print("\nConfusion Matrix:")
cm_lr = confusion_matrix(y_test, y_pred_lr, labels=text_clf_lr.classes_)
print(cm_lr)
# You could generate another Plotly confusion matrix for Logistic Regression here
# fig_lr = plot_confusion_matrix(cm_lr, text_clf_lr.classes_)
# print(fig_lr.to_json(pretty=False))
By comparing the evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix) from both models, you can determine which classifier performed better on this specific task and dataset.
This exercise demonstrated the fundamental workflow of building and evaluating a text classifier. To improve upon this baseline, you could:
TfidfVectorizer
parameters (e.g., ngram_range=(1, 2)
to include bigrams, max_df
, min_df
).LinearSVC
is often effective for text).GridSearchCV
or RandomizedSearchCV
to find the optimal settings for the vectorizer and classifier (covered earlier in the chapter).cross_val_score
for a more robust estimate of model performance than a single train-test split.This practical application solidifies the process of turning text into classifications, a common and valuable task in Natural Language Processing.
© 2025 ApX Machine Learning