In this chapter, we've discussed how different types of data, like text, images, and audio, are structured for AI systems. Now, let's move from descriptions to actual examples. This practical look will help solidify your understanding of what these data formats look like when prepared for a machine. Think of this as opening the hood to see the engine's parts after reading the manual.
Text is everywhere, from simple messages to lengthy documents. But how does an AI model "read" text? It starts by breaking it down into manageable pieces and then converting those pieces into numbers.
Raw Text: This is text as we see it. Let's take a simple sentence:
AI processes data.
Tokenization: The first step is usually to break the sentence into individual units called tokens. These tokens are often words or punctuation marks.
Our sentence AI processes data.
would be tokenized into a list like this:
["AI", "processes", "data", "."]
Numerical Representation: AI models work with numbers, not text directly. So, these tokens are converted into numerical form. A straightforward way is to assign a unique ID to each word in a predefined vocabulary.
Imagine a very small vocabulary for our example:
AI
: 0processes
: 1data
: 2.
(period): 3Using this vocabulary, our tokenized sentence ["AI", "processes", "data", "."]
becomes a sequence of numbers:
[0, 1, 2, 3]
Real AI systems use much larger vocabularies, sometimes containing tens of thousands or even millions of words. Each word gets a unique numerical ID. More advanced techniques create dense vector representations (embeddings) for words, capturing semantic meaning, but the core idea starts with this kind of numerical mapping.
As mentioned earlier, an image can be thought of as a grid of pixel values. Each pixel has a value (or set of values) representing its color and intensity.
Let's consider a very small grayscale image, say 3 pixels wide and 3 pixels tall. Its representation would be a 2D array (or matrix) of numbers. Each number typically ranges from 0 (black) to 255 (white), indicating the intensity of the grayscale pixel.
For example, such a 3x3 grayscale image might be represented as:
Image Matrix (Grayscale)=2101508018012050906020A 3x3 grayscale image represented as a matrix of pixel intensity values. Higher values could mean brighter pixels (e.g., 210 is brighter than 20).
For color images, the representation is similar but a bit more complex. A common way is to use the RGB color model, where each pixel is represented by three values: one for Red, one for Green, and one for Blue intensity. So, a color image would be like having three such matrices, one for each color channel. For an image of size W×H (width W pixels, height H pixels), a grayscale image is an W×H matrix, and an RGB color image is an W×H×3 tensor (a 3D array).
Audio, like speech or music, starts as a continuous sound wave. To process it digitally, this wave is sampled at regular intervals. At each interval, the amplitude (loudness) of the wave is measured. This results in a sequence of numbers representing the audio.
Imagine a tiny snippet of sound. Its digital representation might look like this sequence of amplitude values:
[0.0, 0.25, 0.7, 0.3, -0.2, -0.6, -0.1, 0.15]
Each number is an amplitude measurement at a specific point in time. We can visualize this:
A simplified digital representation of a sound wave, showing amplitude values at discrete time steps.
These raw waveform values can be used directly, or they can be transformed into more complex representations like spectrograms, which show the spectrum of frequencies present in the audio over time. A spectrogram itself is often represented as a 2D array, much like a grayscale image.
Video data combines moving images with sound. Essentially, a video is a sequence of image frames displayed rapidly one after another to create the illusion of motion, accompanied by an audio track that is synchronized with these frames.
So, for an AI system, video data breaks down into two main components:
Here's a simple diagram illustrating this structure:
Structure of video data, illustrating its composition from a sequence of image frames and an associated audio track.
Processing video data often involves handling these two streams, sometimes separately at first, and then considering their relationships (e.g., how the audio aligns with the visual events in the frames).
In this hands-on look, we've seen simplified examples of how text, images, audio, and video are transformed into formats that AI models can understand and process.
These numerical representations are the fundamental building blocks that AI models use for learning and making predictions. The preprocessing steps discussed earlier in the chapter help clean and standardize these formats. Understanding these basic representations is an important first step before we explore how AI models integrate information from these diverse sources.
Was this section helpful?
© 2025 ApX Machine Learning