As we've seen, feedforward networks, or Multi-Layer Perceptrons (MLPs), are powerful tools. However, when dealing with data like images, their standard structure reveals significant limitations. Consider what happens when you feed an image into an MLP: typically, you flatten the image matrix (e.g., a 28x28 pixel image becomes a 784-element vector).
This flattening process immediately discards crucial spatial information. Pixels that were adjacent in the 2D grid, potentially forming a line or texture, are treated as independent inputs after flattening. The network loses the inherent structure of the data; it doesn't inherently understand that pixels (0,0)
and (0,1)
are closer and likely more related than pixels (0,0)
and (27,27)
. An MLP would have to learn these spatial relationships from scratch, which is highly inefficient.
Furthermore, high-resolution images lead to enormous input vectors. A modest 224x224 color image has 224×224×3=150,528 input values. Connecting this input layer to even a moderately sized first hidden layer in an MLP results in an explosion of parameters (weights and biases). Training such a network becomes computationally expensive and requires vast amounts of data to avoid overfitting, as the model has excessive capacity relative to the underlying structure it's trying to learn.
An MLP typically connects every input feature (flattened pixel) to every neuron in the first hidden layer, ignoring spatial relationships and leading to a large number of parameters.
Convolutional Neural Networks (CNNs) were developed specifically to address these shortcomings. They are designed to process data that comes in the form of multiple arrays, like a color image composed of 3D arrays (height, width, color channels). The core motivation behind CNNs stems from three key ideas that make them highly effective for tasks involving grid-like data:
Local Receptive Fields: Instead of connecting every input neuron to every hidden neuron, CNNs employ neurons that only respond to a restricted region of the input layer. This region is called the local receptive field. Imagine a small window (e.g., 3x3 or 5x5 pixels) sliding over the input image. Neurons in the next layer process information only within that window at a time. This directly leverages the spatial locality present in images; nearby pixels are likely related and contribute to forming elementary visual features like edges or corners.
Parameter Sharing: This is a powerful concept. In a CNN, the same set of weights (often called a filter or kernel) is applied across different locations in the input image. If a filter is designed to detect a horizontal edge, it slides across the entire image, detecting that edge wherever it appears. This dramatically reduces the number of parameters compared to an MLP. Instead of learning separate weights for detecting a feature at every possible location, the CNN learns a single set of weights for that feature detector. This makes the network much more efficient and less prone to overfitting.
Translation Invariance (Equivariance): A direct consequence of parameter sharing is that CNNs possess a degree of translation invariance (more accurately, equivariance). If the network learns to detect a pattern (e.g., a cat's eye) in one part of the image using a specific filter, it can detect the same pattern if it appears in a different location using the same filter. This is highly desirable for object recognition and other image analysis tasks, as the object's identity doesn't change based on its position in the frame.
A CNN neuron in a feature map connects only to a local patch (receptive field) of the input. The weights defining this connection (the filter) are shared across different spatial locations.
In essence, CNNs are motivated by the need to build models that respect the spatial hierarchy of image data. They learn features locally, share parameters efficiently, and build up complex representations from simpler ones, making them the standard approach for many computer vision tasks. The following sections will detail the specific operations, like convolution and pooling, that enable these properties.
© 2025 ApX Machine Learning