When we look at a photograph or a frame from a video, our brains effortlessly process the scene, identifying objects, colors, and their relationships. For an AI system to do something similar, it first needs a way to "see" and interpret this visual information. This involves converting images into a language machines understand: numbers. As the chapter introduction mentioned, one way to think about an image is as a grid of pixel values, where each pixel at coordinates (x,y) has an intensity I(x,y). Let's explore this and other ways images are represented for AI.
At its most fundamental level, a digital image is a collection of pixels, short for "picture elements." Each pixel is the smallest unit of an image and holds information about the color or intensity at a specific point.
The simplest type of image is a grayscale image, often called a black-and-white image. In a grayscale image, each pixel is represented by a single number. This number typically indicates the intensity of light, ranging from 0 (black) to 255 (white), with various shades of gray in between. So, for an image with width W and height H, we can imagine it as a 2D array or matrix of numbers. If I(x,y) is the intensity of the pixel at row y and column x:
A tiny 3x3 grayscale image might look like this if we wrote down its pixel values:
[ [ 10, 30, 50 ],
[120, 150, 180 ],
[200, 220, 255 ] ]
Here, the pixel at (0,0) (top-left) is very dark (10), while the pixel at (2,2) (bottom-right) is pure white (255).
Color images are a bit more complex. They typically use a combination of primary colors to represent a full spectrum of hues. The most common model is RGB, which stands for Red, Green, and Blue. In the RGB model, each pixel has three values, one for the intensity of red, one for green, and one for blue. Each of these values also often ranges from 0 to 255.
So, a color pixel P(x,y) can be represented as a set of three values: (R(x,y),G(x,y),B(x,y)).
A tiny 1x2 color image (1 pixel high, 2 pixels wide) might have values like:
[ [(255,0,0), (0,255,0)] ] // First pixel is Red, Second pixel is Green
You can think of a color image as three separate grayscale images (one for each color channel) stacked on top of each other.
The dimensions of an image are usually described as width x height x channels
.
So, a small color photo of 640 pixels wide, 480 pixels high would be represented by 640×480×3=921,600 numbers! This can be a lot of data for an AI model to process directly.
While raw pixel values are the basic representation, they are not always the most effective input for AI models, especially for complex tasks. Here's why:
To address this, we often extract features from images. Features are derived, more compact, and hopefully more informative representations of the image content.
A color histogram is a simple yet useful feature. It counts how many pixels in an image fall into certain color ranges or bins. For example, it can tell us the distribution of reds, greens, and blues in an image, or the overall brightness distribution.
A histogram doesn't care where the colors are in the image, just how much of each color (or intensity) is present. This makes it robust to changes like object rotation or translation.
A simplified color histogram showing the proportion of dominant red, green, and blue tones in an example image. This gives a general sense of the image's color palette.
Beyond color histograms, other types of features can be extracted:
The goal of these engineered features is often to reduce the amount of data while retaining or even emphasizing the important information for a given task.
Pixels are arranged in a grid, and this spatial arrangement, or structure, is fundamental. The relationships between neighboring pixels, and groups of pixels, define shapes, objects, and the overall scene.
While features like color histograms discard spatial information, many AI techniques, particularly modern deep learning models like Convolutional Neural Networks (CNNs), are specifically designed to learn from and exploit this local spatial structure directly from the pixel data or from learned features. Understanding that images have inherent structure is important before diving into how models use this structure.
In more advanced AI, especially with deep learning, models can learn to transform an entire image (or parts of it) into a dense list of numbers called an image embedding or a feature vector. This embedding is a relatively low-dimensional representation compared to the raw pixels, but it aims to capture the high-level semantic content of the image.
Imagine you have a 1000-pixel by 1000-pixel color image. That's 3 million raw pixel values! An image embedding might represent this same image using just a few hundred numbers. The key idea is that images with similar content (e.g., two different photos of cats) will have embeddings that are "close" to each other in this numerical space, while images with different content (e.g., a cat and a car) will have embeddings that are "far apart."
We won't go into how these embeddings are learned in this introductory course, but it's useful to know they exist. They are a powerful way to represent images for tasks like image search, classification, and, importantly for multimodal AI, for comparing and combining image information with other data types like text.
In summary, representing images for AI starts with the raw pixel data. From there, we can extract various features to get more abstract information, always keeping in mind the inherent structure of the image. And for more sophisticated understanding, AI models can learn compact, meaning-rich embeddings. These different representations provide the foundation for AI to "understand" and work with visual information.
Was this section helpful?
© 2025 ApX Machine Learning