All Courses

Text Normalization Techniques

As we discussed earlier in the chapter, raw text data is often inconsistent. Variations in capitalization, punctuation, accents, and abbreviations can make it difficult for algorithms to recognize that different strings actually refer to the same concept. Text normalization is a set of techniques used to convert text into a more standard, uniform format, reducing these variations and improving the consistency of your data. This step is fundamental in preparing text for downstream NLP tasks like feature extraction and model training.

Let's examine some common text normalization techniques.

Case Folding

One of the simplest and most common normalization techniques is case folding, which typically involves converting all characters in the text to lowercase.

Why use it? Consider the words "Apple" (the company) and "apple" (the fruit). If your analysis doesn't require distinguishing between proper nouns and common nouns based on capitalization, converting both to "apple" ensures they are treated as the same token. This reduces the overall vocabulary size and simplifies the feature space. For many tasks like sentiment analysis or topic modeling, the distinction might not be necessary.

Example:

text = "The quick brown fox Jumps over the Lazy Dog."
normalized_text = text.lower()
print(normalized_text)
# Output: the quick brown fox jumps over the lazy dog.

Considerations: While generally beneficial, indiscriminate lowercasing can sometimes lead to loss of information.

Proper Nouns: Distinctions between proper nouns (like "Apple" the company) and common nouns ("apple" the fruit) are lost.
Acronyms: Acronyms like "US" (United States) might become indistinguishable from common words like "us".

In some specific applications, like Named Entity Recognition (NER), preserving case might be important. However, for many general NLP tasks, the benefits of vocabulary reduction often outweigh the potential loss of information. You might also consider more sophisticated approaches, like truecasing, which attempts to restore the correct capitalization, but lowercase conversion is the most frequent starting point.

Handling Punctuation and Special Characters

Punctuation marks (like commas, periods, question marks, exclamation points) and other special characters (like #, @, $, %) often add noise without contributing significant semantic meaning, depending on the context.

Strategies:

Removal: The most common approach is to remove all punctuation. This simplifies the text significantly.
Replacement: Punctuation can sometimes be replaced with whitespace to ensure words separated by punctuation are still treated as distinct tokens (e.g., "end-of-line" becomes "end of line").
Selective Handling: In some cases, certain punctuation might be relevant. For example, hyphens in compound words ("state-of-the-art") or apostrophes in contractions ("don't") might be handled differently. Social media analysis might retain hashtags (#) or mentions (@).

Example (Removal using string and re):

import string
import re

text = "Hello! This is text_with_punctuation #NLP @example.com $50."

# Using string.punctuation
translator = str.maketrans('', '', string.punctuation)
normalized_text_1 = text.translate(translator)
print(f"Method 1: {normalized_text_1}")
# Output: Method 1: Hello world This is textwithpunctuation NLP examplecom 50

# Using regex (removes punctuation, keeps alphanumeric and spaces)
# This pattern keeps letters, numbers, and whitespace, removing others.
normalized_text_2 = re.sub(r'[^\w\s]', '', text)
# Note: \w includes underscore, might need refinement based on needs.
# A more specific pattern to remove only standard punctuation:
normalized_text_3 = re.sub(r'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', '', text)
print(f"Method 3: {normalized_text_3}")
# Output: Method 3: Hello world This is textwithpunctuation NLP examplecom 50

Carefully consider the impact of punctuation removal. Removing hyphens might merge compound modifiers (e.g., "state-of-the-art" becomes "stateoftheart"), while removing apostrophes incorrectly handles possessives or contractions ("it's" becomes "its").

Removing Accents and Diacritics

Accented characters (like é, ñ, ü) are common in many languages but can also appear in English text (e.g., "résumé", "naïve", "café"). Normalizing these characters involves converting them to their closest ASCII equivalents (e.g., "é" to "e", "ñ" to "n").

Why use it? This ensures that words like "resume" and "résumé" are treated as identical, further standardizing the vocabulary.

Example (using unicodedata):

import unicodedata

def remove_accents(input_str):
  nfkd_form = unicodedata.normalize('NFKD', input_str)
  # Keep only non-combining characters (effectively removes accents)
  return "".join([c for c in nfkd_form if not unicodedata.combining(c)])

text = "Résumé naïve café façade"
normalized_text = remove_accents(text)
print(normalized_text)
# Output: Resume naive cafe facade

Libraries like unidecode provide another convenient way to perform this transliteration. As with other normalization steps, be mindful if the distinction carried by the accent is significant for your specific task, although this is less common in primarily English-language processing.

Expanding Contractions

Contractions are shortened forms of words or phrases (e.g., "don't", "I'm", "can't", "it's"). Expanding these back to their original form ("do not", "I am", "cannot", "it is") helps standardize the text.

Why use it? Expansion ensures consistent tokenization and representation. For instance, "don't" might be tokenized differently than "do" and "not". Expanding it ensures the component words are explicitly present.

Example (using a dictionary lookup):

# A small sample mapping
contraction_map = {
    "don't": "do not",
    "can't": "cannot",
    "i'm": "i am",
    "it's": "it is", # Note: careful with possessive 'its'
    "you're": "you are",
    "isn't": "is not"
    # ... add more contractions
}

text = "I'm sure it's okay, don't worry."
# Simple regex-based replacement (requires careful pattern design)
import re

# Create a regex pattern for the keys
contraction_pattern = re.compile(r'\b(' + '|'.join(contraction_map.keys()) + r')\b')

def expand_contractions(text, cmap):
    def replace(match):
        return cmap[match.group(0).lower()] # Use lower() for case-insensitivity
    # Apply the replacement function using the compiled pattern
    # Use lower() on the input text for case-insensitive matching
    return contraction_pattern.sub(replace, text.lower())

normalized_text = expand_contractions(text, contraction_map)
print(normalized_text)
# Output: i am sure it is okay, do not worry.

Building a comprehensive contraction map or using pre-built libraries is common practice. Pay attention to ambiguous cases like "it's" (it is) vs. "its" (possessive), which might require more sophisticated context-aware handling, potentially deferred to later stages like part-of-speech tagging if high precision is needed.

Handling Numbers

Numbers within text can be treated in several ways, depending on whether the numerical information is relevant to the task:

Removal: If exact values are unimportant, numbers can be removed entirely.
Replacement with Placeholder: Replace numbers with a generic token like <NUM> or #. This retains the information that a number was present without keeping the specific value, useful if the presence of a number matters but not its magnitude.
Conversion to Words: Convert digits to their word equivalents (e.g., "5" to "five"). This integrates numbers into the standard vocabulary but can significantly increase its size.

Considerations: The best approach is task-dependent. For financial sentiment analysis, specific numbers might be very important. For general topic modeling, replacing them with a placeholder might suffice.

Example (Replacement using regex):

import re

text = "Order 123 costs $49.99 for 2 items."

# Remove numbers
removed = re.sub(r'\d+', '', text)
print(f"Removed: {removed}")
# Output: Removed: Order  costs $. for  items.

# Replace with placeholder <NUM>
placeholder = re.sub(r'\d+', '<NUM>', text)
print(f"Placeholder: {placeholder}")
# Output: Placeholder: Order <NUM> costs $<NUM>.<NUM> for <NUM> items.

# More specific replacement for currency/numbers
placeholder_refined = re.sub(r'\b\d+\b', '<NUMBER>', text) # Only standalone numbers
placeholder_refined = re.sub(r'\$\d+(\.\d+)?', '<PRICE>', placeholder_refined) # Prices
print(f"Refined Placeholder: {placeholder_refined}")
# Output: Refined Placeholder: Order <NUMBER> costs <PRICE> for <NUMBER> items.

# Conversion to words often requires external libraries like 'inflect'
# import inflect
# p = inflect.engine()
# converted = re.sub(r'\b\d+\b', lambda m: p.number_to_words(m.group(0)), text)
# print(f"Converted: {converted}")
# Output: Converted: Order one hundred and twenty-three costs $49.99 for two items.
# (Note: $49.99 requires more complex handling for full conversion)

Combining Normalization Steps

These normalization techniques are rarely used in isolation. They are typically applied as sequential steps within a larger preprocessing pipeline. The order can matter. For example, expanding contractions before lowercasing might be easier than handling mixed-case contractions. Removing punctuation before expanding contractions might break patterns used for matching (e.g., "don't" becomes "dont").

A typical sequence might be:

Lowercase text.
Expand contractions.
Remove accents.
Remove punctuation (or handle selectively).
Handle numbers (remove, replace, or convert).

However, the optimal sequence and the specific techniques chosen depend heavily on the characteristics of your text data and the goals of your NLP application. Experimentation and evaluation are often necessary to determine the best normalization strategy.

By applying these text normalization techniques, you transform raw, variable text into a standardized format. This cleaned data is much more suitable for the feature engineering methods we will discuss in the next chapter, leading to more effective and reliable NLP models.

Was this section helpful?