In previous chapters, we explored how standard neural networks, particularly those built with Dense
layers, process input data. Each neuron in a Dense
layer connects to every neuron in the previous layer. While effective for many tasks where input features are relatively independent or their order doesn't hold intrinsic meaning (like tabular data), this approach has significant limitations when dealing with data possessing inherent spatial structure, most notably images.
Imagine flattening a 2D image into a single long vector to feed it into a Dense
layer. This process discards the crucial spatial relationships between pixels. Pixels that were neighbours, forming edges or textures, are now treated as distant, unrelated features in the input vector. The network loses the information that certain pixels are adjacent horizontally, vertically, or diagonally. Consequently, a standard network struggles to learn features that depend on these local arrangements.
Furthermore, consider the parameter explosion. A relatively small 256x256 grayscale image has 65,536 pixels. If the first hidden layer has just 512 neurons, the connection between the input and this layer would require 65,536×512≈33.5 million weights, plus biases! This number grows dramatically for larger images or color images (which have multiple channels). Training such a network becomes computationally expensive and prone to overfitting due to the sheer number of parameters.
Another challenge is translation variance. If we train a standard network to recognize an object (say, a cat) in the top-left corner of an image, it won't automatically recognize the same cat if it appears in the bottom-right corner. Because the flattened input vector looks entirely different, the network needs to learn the cat pattern again for the new location. This isn't efficient; we want our network to recognize patterns regardless of where they appear in the image.
Convolutional Neural Networks (CNNs) are designed specifically to address these challenges by incorporating two powerful ideas inspired by how biological vision systems process information:
Local Connectivity (Local Receptive Fields): Instead of connecting every input pixel to every neuron in the first hidden layer, CNNs use specialized layers (convolutional layers) where neurons only respond to a small, localized region of the input image. This region is called the neuron's receptive field. This local focus allows the network to first learn small, low-level patterns like edges, corners, and textures directly from the spatial arrangement of nearby pixels.
Parameter Sharing: A single set of weights, forming a "filter" or "kernel," is applied across all locations in the input image. This filter slides over the input, performing the same operation at each position to detect a specific pattern (e.g., a vertical edge). If a pattern is useful to detect in one part of the image, it's likely useful elsewhere. This drastically reduces the number of parameters compared to a fully connected layer (we only learn the weights for the filter, not for every input-neuron connection) and builds in translation invariance, the ability to detect a pattern regardless of its position in the image.
These core principles allow CNNs to build a hierarchy of features. Early layers detect simple patterns (edges, colors), subsequent layers combine these to detect more complex patterns (textures, shapes), and deeper layers combine those to recognize objects or parts of objects. This hierarchical learning process mirrors how we understand visual information and is highly effective for tasks involving grid-like data structures.
The effectiveness of CNNs stems from their ability to exploit the spatial locality present in grid-like data. Images are the canonical example, being 2D grids of pixels. However, the same principles can apply to:
By preserving and analyzing spatial relationships through local connectivity and parameter sharing, CNNs provide a much more efficient and effective architecture for learning from these types of structured data compared to standard fully connected networks. In the following sections, we will examine the specific Keras layers, like Conv2D
and MaxPooling2D
, that implement these fundamental CNN concepts.
© 2025 ApX Machine Learning