As introduced earlier, raw text is seldom ready for direct input into machine learning algorithms. It's often unstructured, inconsistent, and contains elements irrelevant to the underlying meaning or task. To bridge this gap, we employ a sequence of standardized steps known as the Natural Language Processing (NLP) pipeline. Think of it as an assembly line for text, where raw material enters at one end, undergoes various transformations, and emerges as structured, numerical data suitable for analysis or model training.
While the specific components can vary depending on the application and the nature of the text, a typical pipeline includes several core stages focused on cleaning, structuring, and representing the text.
A common sequence of steps in an NLP pipeline, transforming raw text into features suitable for modeling.
Let's briefly examine the purpose of these initial processing stages, which are the focus of this chapter:
Text Cleaning & Normalization: This initial phase deals with removing or altering elements that add noise without contributing significant meaning. This can include removing HTML tags, eliminating special characters or punctuation (or handling them selectively), converting text to a consistent case (e.g., lowercase), and expanding contractions. The goal is to standardize the text and reduce superficial variations.
Tokenization: Text is broken down into smaller units, called tokens. Usually, tokens correspond to words, but they can also be characters, subwords (parts of words), or even sentences, depending on the task and the chosen method. This step is fundamental as most subsequent processing operates on these tokens. We'll cover different tokenization strategies, including subword methods like BPE, later in this chapter.
Stop Word Removal: Common words like "the", "a", "is", "in", which appear frequently but often carry little specific meaning for tasks like classification or topic modeling, are filtered out. While seemingly simple, deciding which words are "stop words" can be domain-specific, requiring customization beyond standard lists.
Stemming & Lemmatization: These techniques aim to reduce words to their base or root form. For example, "running", "runs", and "ran" might all be reduced to "run". Stemming typically involves chopping off prefixes or suffixes using heuristic rules, which is fast but can sometimes result in non-dictionary words. Lemmatization uses vocabulary and morphological analysis to return the actual dictionary form (lemma) of a word, which is usually more accurate but computationally more intensive. The choice between them depends on the specific requirements of the downstream task.
It's important to understand that this pipeline is a conceptual framework, not a rigid sequence that must be followed identically for every project.
Following these preprocessing steps, the cleaned and structured tokens are typically converted into numerical representations through Feature Extraction methods (like Bag-of-Words, TF-IDF, or word embeddings, covered in Chapters 2 and 4). These numerical features then serve as the input for Modeling or Analysis, where machine learning algorithms are trained for tasks like classification (Chapter 3), or sequence models are applied (Chapter 5).
This chapter focuses squarely on the initial, foundational preprocessing stages. In the following sections, we will examine advanced tokenization, compare stemming and lemmatization in detail, discuss strategies for noise handling and stop word customization, and implement these techniques in practical preprocessing pipelines.
© 2025 ApX Machine Learning