Masterclass
After filtering raw text for obvious quality issues, the next step in preparing data for large language models is often text normalization. Normalization aims to standardize text by reducing variations that do not significantly alter meaning but could otherwise lead to data sparsity or inconsistencies for the model. Think of it as smoothing out superficial differences to present the underlying content more uniformly. This process typically occurs before tokenization, as the normalization choices directly influence how the tokenizer breaks down the text and builds its vocabulary.
Applying consistent normalization is important for reducing the complexity the model needs to handle. Without it, variations like "U.S.A.", "USA", and "usa" might be treated as distinct entities, fragmenting the model's understanding. However, normalization is a balancing act; over-aggressive normalization can erase meaningful distinctions.
Let's examine some standard text normalization techniques frequently employed in LLM data pipelines.
Converting text to a single case, usually lowercase, is one of the simplest and most common normalization steps. It ensures that words like "Apple" (the company) and "apple" (the fruit) are treated the same initially, reducing the effective vocabulary size and preventing the model from needing to learn separate representations for capitalized variations used at sentence beginnings or in titles.
import torch # Though not used here, standard to import in ML context
text = "The Quick Brown Fox Jumps Over The Lazy Dog."
lower_text = text.lower()
print(f"Original: {text}")
print(f"Lowercase: {lower_text}")
# Output:
# Original: The Quick Brown Fox Jumps Over The Lazy Dog.
# Lowercase: the quick brown fox jumps over the lazy dog.
While generally beneficial, case folding has drawbacks. It collapses potentially useful distinctions, such as differentiating between "Apple" the company and "apple" the fruit, or between acronyms like "IT" (Information Technology) and the word "it". For general-purpose LLMs trained on massive, diverse datasets, the benefits of vocabulary reduction often outweigh the loss of case information, which the model might learn to infer from context anyway. However, for specific domains or tasks, preserving case might be necessary.
Text data scraped from various sources often contains characters represented in multiple ways in Unicode. For instance, an accented character like 'é' can be represented as a single precomposed character (U+00E9) or as a base character 'e' followed by a combining acute accent (U+0065 U+0301). While visually identical, these different byte sequences can be treated as distinct by downstream processes if not normalized.
The unicodedata
module in Python provides standard normalization forms:
For LLM training, NFC is a safe baseline, ensuring canonical representation without altering content significantly. NFKC is sometimes used for stronger normalization, potentially simplifying the vocabulary further by collapsing visual variants, but requires caution as it can alter semantics in rare cases.
import unicodedata
# Example: Precomposed vs. Decomposed 'é'
char_nfc = 'é' # U+00E9
char_nfd = 'e\u0301' # U+0065 U+0301
print(f"NFC string: '{char_nfc}', Length: {len(char_nfc)}")
print(f"NFD string: '{char_nfd}', Length: {len(char_nfd)}")
print(f"Are they equal? {char_nfc == char_nfd}") # False
# Normalizing both to NFC
normalized_nfc_1 = unicodedata.normalize('NFC', char_nfc)
normalized_nfc_2 = unicodedata.normalize('NFC', char_nfd)
print(
f"Normalized NFC 1: '{normalized_nfc_1}', Length: {len(normalized_nfc_1)}"
)
print(
f"Normalized NFC 2: '{normalized_nfc_2}', Length: {len(normalized_nfc_2)}"
)
print(
f"Are normalized forms equal? {normalized_nfc_1 == normalized_nfc_2}" # True
)
# Example with NFKC collapsing a ligature
ligature = 'fi' # U+FB01
normalized_nfkc = unicodedata.normalize('NFKC', ligature)
print(f"Original ligature: '{ligature}'")
print(f"NFKC normalized: '{normalized_nfkc}'") # Output: 'fi'
Choosing the right Unicode normalization form depends on the nature of the data and the desired level of standardization. NFC is generally recommended as a starting point.
Sometimes, accents or other diacritical marks are removed to further simplify the text, mapping characters like 'é', 'ê', 'è' all to 'e'. This is a more aggressive normalization step than case folding or standard Unicode normalization.
import unicodedata
def remove_accents(input_str):
# Normalize to NFD (decompose characters into base + combining marks)
nfkd_form = unicodedata.normalize('NFD', input_str)
# Filter out combining marks (category 'Mn')
return "".join([
c for c in nfkd_form
if not unicodedata.category(c) == 'Mn'
])
text_with_accents = "El niño juega al fútbol en el café."
text_without_accents = remove_accents(text_with_accents)
print(f"Original: {text_with_accents}")
print(f"Accents removed: {text_without_accents}")
# Output:
# Original: El niño juega al fútbol en el café.
# Accents removed: El nino juega al futbol en el cafe.
While this simplifies text, it can be problematic for languages where accents are phonemically significant (distinguish meaning), such as French ('pêche' - peach vs. 'péché' - sin) or Spanish ('año' - year vs. 'ano' - anus). For multilingual models or datasets containing such languages, removing accents is generally discouraged as it discards important linguistic information. Its use might be justified only if the target application specifically requires accent-insensitive matching or if the source data quality is extremely poor regarding accents.
Inconsistent whitespace (extra spaces, tabs, newlines) is common in raw text. Normalizing whitespace involves:
\r\n
or \r
to \n
).import re
text_with_bad_whitespace = " This text \t has extra \n\n whitespace. "
# Strip leading/trailing whitespace
stripped_text = text_with_bad_whitespace.strip()
# Replace multiple whitespace characters with a single space
normalized_whitespace_text = re.sub(r'\s+', ' ', stripped_text)
print(f"Original: '{text_with_bad_whitespace}'")
print(f"Normalized: '{normalized_whitespace_text}'")
# Output:
# Original: ' This text has extra
#
# whitespace. '
# Normalized: 'This text has extra whitespace.'
This step ensures consistent spacing, which aids tokenization and prevents the model from learning spurious patterns based on arbitrary whitespace variations.
Applying these text normalization methods thoughtfully helps create cleaner, more consistent datasets, which in turn contributes to more stable training and potentially better performance for large language models. The key is consistency and making informed choices about the trade-offs involved.
© 2025 ApX Machine Learning