Before AI models can begin to learn from different types of data, the raw data itself needs to be cleaned and structured. Think of raw data as ingredients straight from the garden; they need some washing and chopping before they can be used in a recipe. This preparation phase is known as preprocessing. Its goal is to transform the data into a consistent and usable format, making it easier for AI algorithms to process and learn effectively. The specific steps can vary depending on the data type and the problem you're trying to solve, but some fundamental techniques are widely used.
Let's look at common preprocessing steps for text, image, and audio data.
Text data, like articles, comments, or messages, is unstructured and needs quite a bit of tidying up. Here are some typical steps:
Lowercasing: Convert all text to a single case, usually lowercase. This ensures that words like "Apple" (the company) and "apple" (the fruit, or the company at the start of a sentence) are treated as the same word if their casing isn't semantically important for the task. For example, "The Cat sat." becomes "the cat sat.".
Punctuation Removal:
Remove punctuation marks such as periods, commas, exclamation marks, and question marks (e.g., .
, ,
, !
, ?
). While punctuation provides grammatical structure for humans, it can sometimes be noise for simpler AI models. So, "the cat sat." might become "the cat sat".
Tokenization:
Break down the text into individual units, called tokens. Most commonly, tokens are words, but they can also be characters or sub-word units. For instance, "the cat sat" is tokenized into ["the", "cat", "sat"]
.
Stop Word Removal:
Eliminate common words that appear frequently but often don't carry much specific meaning for many tasks. Examples include "the", "a", "an", "is", "in", "on". After removing stop words from ["the", "cat", "sat", "on", "the", "mat"]
, you might get ["cat", "sat", "mat"]
. The exact list of stop words can be customized.
Stemming and Lemmatization: Reduce words to their base or root form.
These text preprocessing steps help standardize the text data, reduce the vocabulary size, and make patterns more apparent for AI models. The following diagram illustrates a typical sequence of text preprocessing operations:
A sequence of common text preprocessing steps, transforming raw text into a more structured format of tokens.
Image data, which you learned is represented as a grid of pixel values, also requires preparation:
Resizing or Scaling: Images come in various dimensions. Most AI models, especially neural networks, expect inputs of a fixed size. Therefore, images are typically resized to a standard height and width (e.g., 224x224 pixels or 256x256 pixels). This might involve scaling the image up or down, and sometimes cropping.
Normalization: Pixel values are often in the range of 0 to 255. Normalization scales these values to a smaller, standard range, such as 0 to 1 or -1 to 1. A common way to normalize to the [0, 1] range is by dividing each pixel value I(x,y) by 255:
Inormalized(x,y)=255I(x,y)Normalization helps stabilize and speed up the training process for many AI models.
Grayscaling: For some tasks, color information might not be necessary or could even be a distraction. In such cases, color images can be converted to grayscale. This reduces the amount of data per image (from three color channels, typically Red, Green, Blue, to one intensity channel), simplifying the model's input.
Data Augmentation (A Glimpse): While sometimes considered a separate step, data augmentation techniques are often applied during preprocessing. These involve creating modified versions of existing images by applying random transformations like rotations, flips, zooms, or brightness adjustments. This artificially increases the size and diversity of the training dataset, which can help the model generalize better to new, unseen images. We won't go deep into this here, but it's a very common practice.
Audio signals, represented as waveforms or derived features, also benefit from preprocessing:
Resampling: Audio can be recorded at different sampling rates (the number of samples of audio carried per second, measured in Hz or kHz). For consistency, all audio files in a dataset are typically resampled to a uniform sampling rate, for example, 16 kHz (common for speech) or 44.1 kHz (CD quality).
Normalization (Amplitude Scaling): Audio signals can have varying loudness levels. Normalization adjusts the amplitude of the audio waveform to a standard range, for example, between -1 and 1. This prevents segments with very high or very low amplitudes from disproportionately influencing the model.
Noise Reduction: Recordings often contain background noise. Various techniques, from simple filters to more advanced algorithms, can be applied to reduce or remove this noise, making the underlying signal (like speech or music) clearer.
Framing and Windowing: Audio signals are typically non-stationary, meaning their statistical properties change over time. To handle this, long audio signals are usually segmented into short, often overlapping, frames (e.g., 20-30 milliseconds long). A windowing function (like a Hamming window) is applied to each frame to smooth the edges and reduce spectral leakage when analyzing frequencies.
As you know, video data combines sequences of images (frames) with an audio track. Preprocessing video data generally involves applying the image preprocessing techniques described above to each frame (e.g., resizing, normalization) and the audio preprocessing techniques to the accompanying audio stream.
Beyond these, video-specific preprocessing might include:
The specific preprocessing steps for any modality are chosen based on the requirements of the AI model and the nature of the task. These initial preparations are fundamental for ensuring that the data fed into AI systems is clean, consistent, and in a format conducive to learning. Once individual data modalities are properly preprocessed, we can then begin to explore how to combine them effectively in multimodal systems.
Was this section helpful?
© 2025 ApX Machine Learning