Having explored the individual techniques for cleaning and structuring raw text, let's put theory into practice. In this section, we will combine the methods discussed earlier, normalization, tokenization, stop word removal, and lemmatization, to construct a cohesive text preprocessing pipeline. Building such a pipeline is a standard first step in many NLP applications, ensuring that text data is consistent and ready for subsequent analysis or model training.
We'll use Python along with common libraries like NLTK or spaCy. Ensure you have these libraries installed and have downloaded the necessary resources (like stop word lists and lemmatization models).
# Example setup using NLTK (run these commands in your Python environment if needed)
# import nltk
# nltk.download('punkt') # For tokenization
# nltk.download('wordnet') # For lemmatization
# nltk.download('stopwords') # For stop words
Let's start with a few sample sentences that contain typical text variations we need to handle:
sample_texts = [
"The QUICK brown fox Jumped over the LAZY dog!!",
"Preprocessing text data is an important first step.",
"We'll be removing punctuation, numbers (like 123), and applying lemmatization.",
"Stop words such as 'the', 'is', 'and' will be filtered out."
]
Our goal is to transform these raw sentences into lists of cleaned, meaningful tokens.
A preprocessing pipeline typically executes a sequence of operations. The order can sometimes matter, so it's important to consider the effect of each step on the subsequent ones.
Converting text to lowercase is a simple yet effective normalization technique that ensures words like "The" and "the" are treated identically.
import re
def normalize_text(text):
"""Converts text to lowercase and removes non-alphanumeric characters."""
text = text.lower()
# Remove punctuation and numbers using regular expressions
text = re.sub(r'[^a-z\s]', '', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
normalized_texts = [normalize_text(text) for text in sample_texts]
print("Normalized Texts:")
for text in normalized_texts:
print(f"- {text}")
Output:
Normalized Texts:
- the quick brown fox jumped over the lazy dog
- preprocessing text data is an important first step
- well be removing punctuation numbers like and applying lemmatization
- stop words such as the is and will be filtered out
Here, we combined lowercasing with the removal of punctuation and numbers using a regular expression [^a-z\s]
which keeps only lowercase letters and whitespace.
Next, we split the normalized text into individual words or tokens. We'll use NLTK's word_tokenize
function.
from nltk.tokenize import word_tokenize
tokenized_texts = [word_tokenize(text) for text in normalized_texts]
print("\nTokenized Texts:")
for tokens in tokenized_texts:
print(f"- {tokens}")
Output:
Tokenized Texts:
- ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
- ['preprocessing', 'text', 'data', 'is', 'an', 'important', 'first', 'step']
- ['well', 'be', 'removing', 'punctuation', 'numbers', 'like', 'and', 'applying', 'lemmatization']
- ['stop', 'words', 'such', 'as', 'the', 'is', 'and', 'will', 'be', 'filtered', 'out']
Now, we filter out common words (stop words) that often add little semantic value for analysis. We'll use NLTK's standard English stop word list and demonstrate adding a custom stop word.
from nltk.corpus import stopwords
# Get standard English stop words
stop_words = set(stopwords.words('english'))
# Example: Add a domain-specific stop word if needed
# stop_words.add('data') # Uncomment to add 'data' as a stop word
def remove_stopwords(tokens):
"""Removes stop words from a list of tokens."""
return [token for token in tokens if token not in stop_words]
filtered_texts = [remove_stopwords(tokens) for tokens in tokenized_texts]
print("\nTexts after Stop Word Removal:")
for tokens in filtered_texts:
print(f"- {tokens}")
Output:
Texts after Stop Word Removal:
- ['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']
- ['preprocessing', 'text', 'data', 'important', 'first', 'step']
- ['well', 'removing', 'punctuation', 'numbers', 'like', 'applying', 'lemmatization']
- ['stop', 'words', 'filtered']
Notice how words like 'the', 'is', 'an', 'be', 'and', 'will', 'out', 'such', 'as' have been removed. If we had uncommented stop_words.add('data')
, the word 'data' would also have been removed from the second sentence. Customizing stop lists based on your specific task or domain is often beneficial.
Finally, we reduce words to their base or dictionary form (lemma) using lemmatization. This helps group different inflections of a word together. We'll use NLTK's WordNetLemmatizer
.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_tokens(tokens):
"""Lemmatizes a list of tokens."""
# Note: Lemmatization can be improved by providing Part-of-Speech tags,
# but for simplicity, we'll use the default (noun) here.
return [lemmatizer.lemmatize(token) for token in tokens]
lemmatized_texts = [lemmatize_tokens(tokens) for tokens in filtered_texts]
print("\nLemmatized Texts:")
for tokens in lemmatized_texts:
print(f"- {tokens}")
Output:
Lemmatized Texts:
- ['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']
- ['preprocessing', 'text', 'data', 'important', 'first', 'step']
- ['well', 'removing', 'punctuation', 'number', 'like', 'applying', 'lemmatization']
- ['stop', 'word', 'filtered']
Observe that 'jumped' remained the same (as the default lemmatization POS tag is noun), 'numbers' became 'number', and 'words' became 'word'. More sophisticated lemmatization would involve Part-of-Speech (POS) tagging before lemmatizing to get more accurate base forms (e.g., correctly lemmatizing 'jumped' as 'jump' if tagged as a verb).
Let's consolidate these steps into a single function for easier reuse.
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize resources once
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
"""Applies the full preprocessing pipeline to a single text string."""
# 1. Normalize (lowercase, remove punctuation/numbers)
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
# 2. Tokenize
tokens = word_tokenize(text)
# 3. Remove Stop Words
filtered_tokens = [token for token in tokens if token not in stop_words]
# 4. Lemmatize
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
return lemmatized_tokens
# Apply the pipeline to our original samples
processed_texts = [preprocess_text(text) for text in sample_texts]
print("\nFully Processed Texts:")
for tokens in processed_texts:
print(f"- {tokens}")
Output:
Fully Processed Texts:
- ['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']
- ['preprocessing', 'text', 'data', 'important', 'first', 'step']
- ['well', 'removing', 'punctuation', 'number', 'like', 'applying', 'lemmatization']
- ['stop', 'word', 'filtered']
This preprocess_text
function now encapsulates our entire pipeline.
We can represent the sequence of operations using a simple diagram:
A diagram illustrating the sequence of steps in the text preprocessing pipeline.
This practical exercise demonstrates how to combine fundamental text preprocessing techniques into a functional pipeline. The clean, lemmatized tokens produced are now in a much better state for downstream NLP tasks, such as the feature engineering methods we will cover in the next chapter.
© 2025 ApX Machine Learning