While TF-IDF provides a significant improvement over simple Bag-of-Words by weighting terms based on their importance, both methods share a fundamental limitation: they disregard word order. Consider the phrases "good service" and "service not good". In a standard BoW or TF-IDF representation, these might appear very similar because they contain the same words ("good", "service", "not"), just rearranged. However, their meanings are opposite. This loss of sequential information can be detrimental for many NLP tasks, such as sentiment analysis or machine translation.
To address this, we can extend our feature representation to include sequences of adjacent words. This is where N-grams come into play.
An N-gram is simply a contiguous sequence of N items from a given sample of text or speech. The "items" are typically words, but can also be characters.
By generating features based not just on single words (unigrams) but also on pairs (bigrams), triplets (trigrams), or longer sequences, we incorporate local word order into our text representation.
How does this help? Let's revisit our example: "good service" vs. "service not good".
Notice how adding bigrams creates distinct features. The bigram "not good" clearly captures the negative sentiment that would be missed if we only looked at the unigrams "not" and "good" independently. Similarly, bigrams like "New York" or "San Francisco" represent concepts distinct from their constituent words.
Using N-grams allows the model to learn patterns based on these short phrases, providing a richer understanding of the text beyond individual word counts.
Diagram illustrating the generation of unigrams, bigrams, and trigrams from a sample sentence.
While N-grams enhance context representation, they come at a cost: a significant increase in the number of potential features.
Imagine a vocabulary of 10,000 unique words (unigrams).
Although many of these potential N-grams will never actually appear in the corpus, the number of observed N-grams can still be vastly larger than the number of unigrams. This leads to:
Approximate number of unique features generated for different N-gram sizes on a hypothetical corpus with a 10k unigram vocabulary. Note the logarithmic scale on the Y-axis, highlighting the rapid growth. Actual numbers depend heavily on the corpus size and language characteristics.
In practice, we rarely use only higher-order N-grams (like only trigrams). A common strategy is to combine unigrams with bigrams, or perhaps unigrams, bigrams, and trigrams. This provides the benefits of context capture while keeping the unigram information.
The choice of N (the maximum sequence length) and whether to include lower-order grams depends on the specific task and available computational resources. Bigrams often provide a good balance between capturing context and managing feature space size.
You can generate N-gram features using libraries like scikit-learn. For instance, the CountVectorizer
or TfidfVectorizer
classes accept an ngram_range
parameter. Setting ngram_range=(1, 2)
instructs the vectorizer to generate both unigrams and bigrams.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
# Generate unigrams and bigrams
vectorizer_1_2 = CountVectorizer(ngram_range=(1, 2))
X_1_2 = vectorizer_1_2.fit_transform(corpus)
print("Vocabulary (Unigrams + Bigrams):")
# Show a sample of the combined vocabulary
print(sorted(vectorizer_1_2.vocabulary_.keys())[:15])
print("\nFeature Matrix Shape (documents, features):")
print(X_1_2.shape)
# Generate only bigrams
vectorizer_2_2 = CountVectorizer(ngram_range=(2, 2))
X_2_2 = vectorizer_2_2.fit_transform(corpus)
print("\nVocabulary (Only Bigrams):")
print(sorted(vectorizer_2_2.vocabulary_.keys()))
print("\nFeature Matrix Shape (documents, features):")
print(X_2_2.shape)
This code snippet demonstrates how easily you can incorporate N-grams. Notice how setting ngram_range=(1, 2)
results in more features than ngram_range=(2, 2)
, as it includes both individual words and pairs. The output shape (documents, features)
clearly shows the dimensionality increase.
It's also worth mentioning character N-grams. Instead of splitting text into words, we can create sequences of N characters. For example, character trigrams for the word "context" would include "con", "ont", "nte", "tex", "ext".
Character N-grams can be particularly useful for:
However, they also lead to an even larger feature space than word N-grams and may capture less semantic meaning compared to word N-grams.
N-grams provide a straightforward and effective way to inject information about local word order into your text features. While they increase dimensionality, the added contextual information often leads to better performance for models built on top of these features, especially when combined with techniques like TF-IDF. As we move forward, we'll see how other methods like word embeddings offer different ways to capture semantic relationships, but N-grams remain a valuable tool in the text feature engineering process.
© 2025 ApX Machine Learning