Raw text data, like customer reviews, articles, or social media posts, contains a wealth of information. However, machine learning algorithms primarily operate on numerical data. Directly feeding raw text strings into most standard models won't work. Our task in this section is to transform unstructured text into a structured, numerical format that algorithms can understand and learn from. This process is a specific type of feature engineering often called text vectorization.
We'll focus on two foundational techniques that represent text based on the words it contains: Bag-of-Words and TF-IDF.
The Bag-of-Words model is a simple yet effective way to represent text numerically. Imagine putting all the unique words from your entire collection of documents (the corpus) into a "bag". To represent a specific document, you simply count how many times each word from the bag appears in that document. The order of words and grammar are disregarded, hence the "bag" analogy.
How it Works:
Example:
Consider these two short documents:
['the', 'cat', 'sat', 'on', 'the', 'mat']
['the', 'dog', 'chased', 'the', 'cat']
['the', 'cat', 'sat', 'on', 'mat', 'dog', 'chased']
(Note: 'the' appears multiple times but is listed once in the vocabulary).[2, 1, 1, 1, 1, 0, 0]
(Counts for 'the', 'cat', 'sat', 'on', 'mat', 'dog', 'chased')[2, 1, 0, 0, 0, 1, 1]
Implementation: In Python, the CountVectorizer
class from the scikit-learn
library is commonly used to implement the Bag-of-Words approach.
Limitations:
TF-IDF builds upon the Bag-of-Words concept but adds a weighting scheme to highlight words that are more significant or informative for a specific document within the larger corpus. It down-weights terms that appear frequently across many documents (like common stop words) and increases the weight of terms that are frequent in a specific document but rare overall.
TF-IDF is calculated as the product of two components:
Term Frequency (TF): Measures how frequently a term appears in a specific document. It's often calculated as the raw count of the term in the document or normalized (e.g., term count divided by the total number of terms in the document).
Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. It's calculated based on the logarithm of the total number of documents divided by the number of documents containing the term.
IDF(t)=log(Number of documents containing term tTotal number of documents)(Note: Variations exist, often adding 1 to the denominator or numerator to prevent division by zero and smooth the values).
The TF-IDF score for a term t in a document d is:
TF-IDF(t,d)=TF(t,d)×IDF(t)Implementation: The TfidfVectorizer
class in scikit-learn
calculates TF-IDF scores directly from raw text data. It combines tokenization, counting (TF), and IDF calculation.
Benefits over BoW:
Limitations:
Before applying BoW or TF-IDF, text data usually requires preprocessing steps to improve the quality of the resulting features:
scikit-learn
and NLTK
provide standard stop word lists.These preprocessing steps are essential for creating meaningful and efficient numerical representations from text.
The vectors generated by CountVectorizer
or TfidfVectorizer
(often represented as sparse matrices for efficiency) become the input features for your machine learning models. Each column represents a unique word from the vocabulary, and each row represents a document. These features can then be used alongside other numerical or categorical features you might have engineered for tasks like sentiment analysis, topic classification, or spam detection.
While BoW and TF-IDF are powerful foundational techniques, keep in mind that more advanced methods like word embeddings (e.g., Word2Vec, GloVe) capture semantic relationships between words. These are often explored in specialized Natural Language Processing (NLP) contexts but understanding BoW and TF-IDF provides a solid base for working with text data in many data science applications.
© 2025 ApX Machine Learning