As we discussed earlier in the chapter, raw text data is often inconsistent. Variations in capitalization, punctuation, accents, and abbreviations can make it difficult for algorithms to recognize that different strings actually refer to the same concept. Text normalization is a set of techniques used to convert text into a more standard, uniform format, reducing these variations and improving the consistency of your data. This step is fundamental in preparing text for downstream NLP tasks like feature extraction and model training.
Let's examine some common text normalization techniques.
One of the simplest and most common normalization techniques is case folding, which typically involves converting all characters in the text to lowercase.
Why use it? Consider the words "Apple" (the company) and "apple" (the fruit). If your analysis doesn't require distinguishing between proper nouns and common nouns based on capitalization, converting both to "apple" ensures they are treated as the same token. This reduces the overall vocabulary size and simplifies the feature space. For many tasks like sentiment analysis or topic modeling, the distinction might not be necessary.
Example:
text = "The quick brown fox Jumps over the Lazy Dog."
normalized_text = text.lower()
print(normalized_text)
# Output: the quick brown fox jumps over the lazy dog.
Considerations: While generally beneficial, indiscriminate lowercasing can sometimes lead to loss of information.
In some specific applications, like Named Entity Recognition (NER), preserving case might be important. However, for many general NLP tasks, the benefits of vocabulary reduction often outweigh the potential loss of information. You might also consider more sophisticated approaches, like truecasing, which attempts to restore the correct capitalization, but lowercase conversion is the most frequent starting point.
Punctuation marks (like commas, periods, question marks, exclamation points) and other special characters (like #, @, $, %) often add noise without contributing significant semantic meaning, depending on the context.
Strategies:
Example (Removal using string
and re
):
import string
import re
text = "Hello, world! This is text_with_punctuation #NLP @example.com $50."
# Using string.punctuation
translator = str.maketrans('', '', string.punctuation)
normalized_text_1 = text.translate(translator)
print(f"Method 1: {normalized_text_1}")
# Output: Method 1: Hello world This is textwithpunctuation NLP examplecom 50
# Using regex (removes punctuation, keeps alphanumeric and spaces)
# This pattern keeps letters, numbers, and whitespace, removing others.
normalized_text_2 = re.sub(r'[^\w\s]', '', text)
# Note: \w includes underscore, might need refinement based on needs.
# A more specific pattern to remove only standard punctuation:
normalized_text_3 = re.sub(r'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', '', text)
print(f"Method 3: {normalized_text_3}")
# Output: Method 3: Hello world This is textwithpunctuation NLP examplecom 50
Carefully consider the impact of punctuation removal. Removing hyphens might merge compound modifiers (e.g., "state-of-the-art" becomes "stateoftheart"), while removing apostrophes incorrectly handles possessives or contractions ("it's" becomes "its").
Accented characters (like é, ñ, ü) are common in many languages but can also appear in English text (e.g., "résumé", "naïve", "café"). Normalizing these characters involves converting them to their closest ASCII equivalents (e.g., "é" to "e", "ñ" to "n").
Why use it? This ensures that words like "resume" and "résumé" are treated as identical, further standardizing the vocabulary.
Example (using unicodedata
):
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
# Keep only non-combining characters (effectively removes accents)
return "".join([c for c in nfkd_form if not unicodedata.combining(c)])
text = "Résumé naïve café façade"
normalized_text = remove_accents(text)
print(normalized_text)
# Output: Resume naive cafe facade
Libraries like unidecode
provide another convenient way to perform this transliteration. As with other normalization steps, be mindful if the distinction carried by the accent is significant for your specific task, although this is less common in primarily English-language processing.
Contractions are shortened forms of words or phrases (e.g., "don't", "I'm", "can't", "it's"). Expanding these back to their original form ("do not", "I am", "cannot", "it is") helps standardize the text.
Why use it? Expansion ensures consistent tokenization and representation. For instance, "don't" might be tokenized differently than "do" and "not". Expanding it ensures the component words are explicitly present.
Example (using a dictionary lookup):
# A small sample mapping
contraction_map = {
"don't": "do not",
"can't": "cannot",
"i'm": "i am",
"it's": "it is", # Note: careful with possessive 'its'
"you're": "you are",
"isn't": "is not"
# ... add more contractions
}
text = "I'm sure it's okay, don't worry."
# Simple regex-based replacement (requires careful pattern design)
import re
# Create a regex pattern for the keys
contraction_pattern = re.compile(r'\b(' + '|'.join(contraction_map.keys()) + r')\b')
def expand_contractions(text, cmap):
def replace(match):
return cmap[match.group(0).lower()] # Use lower() for case-insensitivity
# Apply the replacement function using the compiled pattern
# Use lower() on the input text for case-insensitive matching
return contraction_pattern.sub(replace, text.lower())
normalized_text = expand_contractions(text, contraction_map)
print(normalized_text)
# Output: i am sure it is okay, do not worry.
Building a comprehensive contraction map or using pre-built libraries is common practice. Pay attention to ambiguous cases like "it's" (it is) vs. "its" (possessive), which might require more sophisticated context-aware handling, potentially deferred to later stages like part-of-speech tagging if high precision is needed.
Numbers within text can be treated in several ways, depending on whether the numerical information is relevant to the task:
<NUM>
or #
. This retains the information that a number was present without keeping the specific value, useful if the presence of a number matters but not its magnitude.Considerations: The best approach is task-dependent. For financial sentiment analysis, specific numbers might be very important. For general topic modeling, replacing them with a placeholder might suffice.
Example (Replacement using regex):
import re
text = "Order 123 costs $49.99 for 2 items."
# Remove numbers
removed = re.sub(r'\d+', '', text)
print(f"Removed: {removed}")
# Output: Removed: Order costs $. for items.
# Replace with placeholder <NUM>
placeholder = re.sub(r'\d+', '<NUM>', text)
print(f"Placeholder: {placeholder}")
# Output: Placeholder: Order <NUM> costs $<NUM>.<NUM> for <NUM> items.
# More specific replacement for currency/numbers
placeholder_refined = re.sub(r'\b\d+\b', '<NUMBER>', text) # Only standalone numbers
placeholder_refined = re.sub(r'\$\d+(\.\d+)?', '<PRICE>', placeholder_refined) # Prices
print(f"Refined Placeholder: {placeholder_refined}")
# Output: Refined Placeholder: Order <NUMBER> costs <PRICE> for <NUMBER> items.
# Conversion to words often requires external libraries like 'inflect'
# import inflect
# p = inflect.engine()
# converted = re.sub(r'\b\d+\b', lambda m: p.number_to_words(m.group(0)), text)
# print(f"Converted: {converted}")
# Output: Converted: Order one hundred and twenty-three costs $49.99 for two items.
# (Note: $49.99 requires more complex handling for full conversion)
These normalization techniques are rarely used in isolation. They are typically applied as sequential steps within a larger preprocessing pipeline. The order can matter. For example, expanding contractions before lowercasing might be easier than handling mixed-case contractions. Removing punctuation before expanding contractions might break patterns used for matching (e.g., "don't" becomes "dont").
A typical sequence might be:
However, the optimal sequence and the specific techniques chosen depend heavily on the characteristics of your text data and the goals of your NLP application. Experimentation and evaluation are often necessary to determine the best normalization strategy.
By applying these text normalization techniques, you transform raw, variable text into a standardized format. This cleaned data is much more suitable for the feature engineering methods we will discuss in the next chapter, leading to more effective and reliable NLP models.
© 2025 ApX Machine Learning