Now that we've established the motivation for Convolutional Neural Networks (CNNs) and explored the fundamental operations of convolution and pooling, let's assemble these components into a typical architecture. While CNN architectures can vary significantly depending on the specific task and dataset, a common pattern has emerged, especially for image classification problems.
A standard CNN generally consists of two main parts:
Let's break down the flow and components:
The input image (or other grid-like data) first passes through a series of convolutional and pooling layers.
This sequence of Convolution -> Activation (ReLU) -> Pooling
often forms a "block", and multiple such blocks can be stacked. As we go deeper into the network (stack more blocks), the convolutional layers tend to learn increasingly complex and abstract features built upon the features detected by earlier layers. The filter sizes might stay small (e.g., 3x3), but the number of filters often increases in deeper layers, allowing the network to capture a wider variety of features.
After the final pooling layer in the feature extraction stack, the resulting feature maps are typically 3D tensors (height x width x number of channels/filters). However, standard Fully Connected layers expect a 1D vector as input. Therefore, a "Flatten" operation is performed. This simply reshapes the 3D feature maps into a single, long 1D vector, effectively lining up all the learned feature activations.
The flattened vector is then fed into one or more Fully Connected layers.
The following diagram illustrates a common CNN structure for image classification:
A typical flow through a CNN: Input image passes through convolutional and pooling layers for feature extraction. The resulting feature maps are flattened into a vector, which is then processed by fully connected layers for final classification or regression.
Here's how you might define a simple CNN architecture similar to the one diagrammed above using PyTorch's nn.Sequential
:
import torch
import torch.nn as nn
# Assuming input images are 32x32 pixels with 3 color channels (RGB)
# And we want to classify into 10 categories
model = nn.Sequential(
# Feature Extraction Block 1
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1), # Output: 32x32x32
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2), # Output: 16x16x32
# Feature Extraction Block 2
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1), # Output: 16x16x64
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2), # Output: 8x8x64
# Flattening
nn.Flatten(), # Output: 8 * 8 * 64 = 4096 features
# Classification Layers
nn.Linear(in_features=8*8*64, out_features=128),
nn.ReLU(),
nn.Linear(in_features=128, out_features=10) # Output layer for 10 classes
# Note: Softmax is often applied implicitly by the loss function (e.g., CrossEntropyLoss)
)
# Example usage: Create a dummy input tensor
dummy_input = torch.randn(1, 3, 32, 32) # (batch_size, channels, height, width)
output = model(dummy_input)
print(output.shape) # Expected output: torch.Size([1, 10])
This structure allows the network to learn spatial hierarchies of features. Early layers detect simple patterns like edges and corners, while deeper layers combine these to recognize more complex structures relevant to the final task. The pooling layers provide robustness and reduce computational load, while the fully connected layers integrate the learned features for prediction.
© 2025 ApX Machine Learning