In the previous chapters, we explored methods like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) to convert text into numerical representations suitable for machine learning algorithms. These techniques rely on word frequencies and document statistics. While useful for tasks where word presence or absence is a strong signal (like basic document categorization), they possess significant limitations when it comes to understanding the deeper meaning and context of language.
The fundamental issue with frequency-based models is their lack of semantic understanding. They operate under the assumption that the meaning of a text can be largely captured by the collection of words it contains, ignoring the relationships between those words.
Consider these points:
Ignoring Word Similarity: BoW and TF-IDF treat every unique word as a distinct, independent feature. This means words with very similar meanings, like "car" and "automobile" or "happy" and "joyful", are represented as completely separate dimensions in the vector space. These models have no inherent way of knowing that these words often signify the same concept. If the training data doesn't contain enough examples where both words appear in highly similar contexts, the model will treat them as unrelated. The resulting vectors for documents containing one word but not the other might be far apart, even if the documents discuss the same topic.
Ignoring Word Order and Context: These models discard the sequence of words. A sentence like "dog bites man" is represented identically to "man bites dog" in a simple BoW model, despite the drastically different meanings. While N-grams (introduced in Chapter 2) can capture some local context by treating sequences like "New York" as single units, they don't fully resolve the issue and lead to a massive increase in the feature space dimensionality. More complex syntactic relationships and long-range dependencies remain elusive.
Polysemy (Words with Multiple Meanings): Frequency-based methods struggle with words that have multiple meanings (polysemy). For instance, the word "bank" could refer to a financial institution or the side of a river. BoW and TF-IDF typically assign a single representation to "bank", regardless of its intended meaning in a specific context. This conflation muddies the resulting vector representation.
Sparsity and High Dimensionality: TF-IDF vectors, while often more informative than raw counts, are typically very high-dimensional (one dimension for each unique word in the vocabulary) and sparse (most entries are zero). This high dimensionality can make computation expensive and sometimes hinders the performance of downstream machine learning models, often requiring dimensionality reduction techniques (as discussed in Chapter 2). However, dimensionality reduction on sparse frequency vectors doesn't magically introduce semantic understanding; it merely compresses the existing frequency information.
Let's visualize the sparsity. Imagine a small vocabulary of just 10 words and two short documents:
A simple BoW representation might look like this (ignoring stop words for simplicity):
Word | Doc 1 | Doc 2 |
---|---|---|
cat | 1 | 0 |
sat | 1 | 0 |
dog | 0 | 1 |
ran | 0 | 1 |
Even with this tiny example, half the entries are zero. In real-world scenarios with vocabularies of tens or hundreds of thousands of words, the vectors become overwhelmingly sparse. TF-IDF helps weight the terms but doesn't change this fundamental sparse structure.
These limitations hinder the performance of NLP systems on tasks requiring nuanced understanding, such as:
Because frequency-based models treat words as isolated units and fail to capture the rich semantic relationships between them, we need different approaches. This chapter introduces methods built on the idea of distributional semantics, which posits that a word's meaning is related to the contexts in which it frequently appears. This leads us to dense vector representations, known as word embeddings, which overcome many limitations of sparse, frequency-based models.
© 2025 ApX Machine Learning