By W. M. Thor on Oct 1, 2024
Sentiment analysis is a popular technique in natural language processing (NLP) used to determine the sentiment or emotion behind a piece of text. Whether it's to understand customer feedback or to analyze social media trends, sentiment analysis offers a way to automate the interpretation of human emotions at scale.
In this guide, we’ll perform sentiment analysis on tweets using the Naive Bayes classifier. This post will take you step-by-step through the process of data preparation, feature extraction, model training, and evaluation.
Sentiment analysis is a text classification problem where the goal is to classify the sentiment of a given text as positive, negative, or neutral. Tweets are ideal for sentiment analysis because they are short, opinionated, and abundant, making them a valuable source for gauging public opinion.
The Naive Bayes classifier is one of the simplest yet most effective algorithms for text classification. It's based on Bayes' theorem and assumes that the presence of a particular feature in a class is independent of the presence of any other feature, hence the term "naive."
For this tutorial, we’ll use a dataset of tweets labeled as positive, negative, or neutral. You can find Twitter sentiment datasets on platforms like Kaggle.
First, you’ll need to install some libraries before we get started. If you don’t already have them, install them using the following command:
pip install pandas numpy scikit-learn nltk
Here’s a breakdown of the libraries we’ll use:
Before training the model, we need to clean and preprocess the tweet data. Tweets often contain noise like URLs, mentions, hashtags, and emojis, which can impact the quality of the model.
# Import libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Download stopwords
nltk.download('stopwords')
# Load the dataset
df = pd.read_csv('tweets_sentiment.csv') # Use your own dataset here
# Display the first few rows of the dataset
print(df.head())
# Data cleaning function
def clean_tweet(text):
text = re.sub(r'@[A-Za-z0-9]+', '', text) # Remove mentions
text = re.sub(r'https?://[A-Za-z0-9./]+', '', text) # Remove URLs
text = re.sub(r'[^a-zA-Z]', ' ', text) # Remove non-alphabetic characters
text = text.lower() # Convert to lowercase
return text
# Apply the cleaning function to the dataset
df['cleaned_tweet'] = df['tweet'].apply(clean_tweet)
# Remove stopwords (common words like 'and', 'the', etc.)
stop_words = set(stopwords.words('english'))
df['cleaned_tweet'] = df['cleaned_tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
To feed the tweets into our Naive Bayes classifier, we need to convert the text into numerical features. We'll use Bag of Words (BoW) representation, where each tweet is represented as a vector of word counts.
# Initialize the CountVectorizer
vectorizer = CountVectorizer(max_features=1500) # Limit the number of features to 1500
# Convert the cleaned tweets to a matrix of token counts
X = vectorizer.fit_transform(df['cleaned_tweet']).toarray()
# Labels (sentiment) as target variable
y = df['sentiment'] # Assuming 'sentiment' column contains the labels (positive, negative, neutral)
Next, we split the dataset into training and testing sets, and then train the Naive Bayes classifier on the training data.
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Naive Bayes classifier
model = MultinomialNB()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
After training the model, it's crucial to evaluate its performance using metrics like accuracy, confusion matrix, and classification report.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
You can improve the model’s performance by trying different strategies:
Tuning hyperparameters: Adjusting the parameters of the Naive Bayes classifier. Using TF-IDF: Instead of using the Bag of Words model, try Term Frequency-Inverse Document Frequency (TF-IDF) to give more importance to unique words. Example of implementing TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1500)
# Convert the cleaned tweets to TF-IDF features
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_tweet']).toarray()
Once you're satisfied with the model, consider deploying it as a web app. You can use frameworks like Flask or Streamlit to create a simple app that accepts tweet inputs and returns sentiment predictions in real-time.
Naive Bayes is a powerful yet simple algorithm for sentiment analysis, particularly for beginners. By following this guide, you now have a working sentiment analysis model that can classify tweets into positive, negative, or neutral sentiments. From here, you can enhance the model, apply it to different datasets, or even deploy it for real-world applications.
Keep experimenting with different models, datasets, and techniques to further improve your machine learning skills!
Featured Posts
Advertisement