All Courses

Using N-grams to Capture Context

While TF-IDF provides a significant improvement over simple Bag-of-Words by weighting terms based on their importance, both methods share a fundamental limitation: they disregard word order. Consider the phrases "good service" and "service not good". In a standard BoW or TF-IDF representation, these might appear very similar because they contain the same words ("good", "service", "not"), just rearranged. However, their meanings are opposite. This loss of sequential information can be detrimental for many NLP tasks, such as sentiment analysis or machine translation.

To address this, we can extend our feature representation to include sequences of adjacent words. This is where N-grams come into play.

What are N-grams?

An N-gram is simply a contiguous sequence of $N$ items from a given sample of text or speech. The "items" are typically words, but can also be characters.

Unigrams (1-grams): These are the individual words we've been dealing with so far in BoW and TF-IDF models. For the sentence "The quick brown fox", the unigrams are "The", "quick", "brown", "fox".
Bigrams (2-grams): These are sequences of two adjacent words. For the same sentence, the bigrams are ("The", "quick"), ("quick", "brown"), ("brown", "fox"). Often represented as "The quick", "quick brown", "brown fox".
Trigrams (3-grams): Sequences of three adjacent words. Example: ("The", "quick", "brown"), ("quick", "brown", "fox"), or "The quick brown", "quick brown fox".
N-grams (General): Sequences of $N$ adjacent words.

By generating features based not just on single words (unigrams) but also on pairs (bigrams), triplets (trigrams), or longer sequences, we incorporate local word order into our text representation.

Capturing Context with N-grams

How does this help? Let's revisit our example: "good service" vs. "service not good".

Unigram Features: {good, service}, {service, not, good}
Bigram Features: {"good service"}, {"service not", "not good"}
Combined Uni+Bigram Features: {good, service, "good service"}, {service, not, good, "service not", "not good"}

Notice how adding bigrams creates distinct features. The bigram "not good" clearly captures the negative sentiment that would be missed if we only looked at the unigrams "not" and "good" independently. Similarly, bigrams like "New York" or "San Francisco" represent concepts distinct from their constituent words.

Using N-grams allows the model to learn patterns based on these short phrases, providing a richer understanding of the text.

Diagram illustrating the generation of unigrams, bigrams, and trigrams from a sample sentence.

The Trade-offs: Feature Space Explosion

While N-grams enhance context representation, they come at a cost: a significant increase in the number of potential features.

Imagine a vocabulary of 10,000 unique words (unigrams).

The number of potential bigrams is $10,000^2 = 100,000,000$ .
The number of potential trigrams is $10,000^3 = 1,000,000,000,000$ .

Although many of these potential N-grams will never actually appear in the corpus, the number of observed N-grams can still be vastly larger than the number of unigrams. This leads to:

High Dimensionality: The feature vectors become very long.
Sparsity: Most entries in the feature vector for any given document will be zero, as only a small subset of all possible N-grams will be present.
Increased Computational Cost: Processing and storing these larger feature sets requires more memory and computational power.

Approximate number of unique features generated for different N-gram sizes on a corpus with a 10k unigram vocabulary. Note the logarithmic scale on the Y-axis, highlighting the rapid growth. Actual numbers depend heavily on the corpus size and language characteristics.

Practical Usage

In practice, we rarely use only higher-order N-grams (like only trigrams). A common strategy is to combine unigrams with bigrams, or perhaps unigrams, bigrams, and trigrams. This provides the benefits of context capture while keeping the unigram information.

The choice of $N$ (the maximum sequence length) and whether to include lower-order grams depends on the specific task and available computational resources. Bigrams often provide a good balance between capturing context and managing feature space size.

You can generate N-gram features using libraries like scikit-learn. For instance, the CountVectorizer or TfidfVectorizer classes accept an ngram_range parameter. Setting ngram_range=(1, 2) instructs the vectorizer to generate both unigrams and bigrams.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one',
    'is this the first document',
]

# Generate unigrams and bigrams
vectorizer_1_2 = CountVectorizer(ngram_range=(1, 2))
X_1_2 = vectorizer_1_2.fit_transform(corpus)

print("Vocabulary (Unigrams + Bigrams):")
# Show a sample of the combined vocabulary
print(sorted(vectorizer_1_2.vocabulary_.keys())[:15]) 

print("\nFeature Matrix Shape (documents, features):")
print(X_1_2.shape)

# Generate only bigrams
vectorizer_2_2 = CountVectorizer(ngram_range=(2, 2))
X_2_2 = vectorizer_2_2.fit_transform(corpus)

print("\nVocabulary (Only Bigrams):")
print(sorted(vectorizer_2_2.vocabulary_.keys()))

print("\nFeature Matrix Shape (documents, features):")
print(X_2_2.shape)

This code snippet demonstrates how easily you can incorporate N-grams. Notice how setting ngram_range=(1, 2) results in more features than ngram_range=(2, 2), as it includes both individual words and pairs. The output shape (documents, features) clearly shows the dimensionality increase.

Character N-grams

It's also worth mentioning character N-grams. Instead of splitting text into words, we can create sequences of $N$ characters. For example, character trigrams for the word "context" would include "con", "ont", "nte", "tex", "ext".

Character N-grams can be particularly useful for:

Handling misspellings and typos (e.g., "service" and "servce" will share many character N-grams).
Dealing with languages without clear word boundaries.
Capturing morphological information (prefixes, suffixes).

However, they also lead to an even larger feature space than word N-grams and may capture less semantic meaning compared to word N-grams.

N-grams provide a straightforward and effective way to inject information about local word order into your text features. While they increase dimensionality, the added contextual information often leads to better performance for models built on top of these features, especially when combined with techniques like TF-IDF. As we move forward, we'll see how other methods like word embeddings offer different ways to capture semantic relationships, but N-grams remain a valuable tool in the text feature engineering process.

Was this section helpful?