As discussed earlier, standard feedforward networks treat input features independently, losing spatial relationships inherent in data like images. Convolutional Neural Networks address this by using a specialized operation called convolution as their core building block. This operation allows the network to learn and detect local patterns within the input, preserving spatial hierarchies.
At the heart of the convolution operation is the filter, also known as a kernel. Think of a filter as a small, learnable matrix of weights. Its purpose is to scan the input data and detect specific features or patterns. For example, in an image, one filter might learn to detect vertical edges, another might detect horizontal edges, and yet another might respond strongly to a particular texture or color gradient.
Filters are typically small spatially (e.g., 3x3, 5x5 pixels) but extend through the full depth of the input volume. If the input is a color image (with Red, Green, Blue channels), a filter will also have a depth of 3.
The convolution operation involves sliding the filter across the input data (like an image or the feature map from a previous layer) systematically. At each position, the filter overlays a small patch of the input. The core computation involves two steps:
This single computed value (sum + bias) represents the response of the filter at that specific location in the input. A strong positive value indicates a strong presence of the feature the filter is designed to detect.
The filter then slides to the next position, and the process repeats until the entire input has been covered.
A 3x3 filter overlaying a region of the input. The corresponding elements are multiplied, and the results are summed (plus bias) to produce one value in the output feature map. The filter then moves to the next position.
The output generated by applying a single filter across the entire input is a 2D matrix called a feature map or activation map. Each element in this map corresponds to the filter's response at a specific spatial location in the input. High activation values signify that the feature detected by the filter is strongly present at that location.
Typically, a convolutional layer uses multiple filters (e.g., 32, 64, or more), each initialized differently and thus learning to detect different features. Applying all these filters to the same input volume results in a stack of 2D feature maps, forming the output volume of the convolutional layer. The depth of this output volume equals the number of filters used.
The stride defines the step size the filter takes as it moves across the input.
Choosing the stride is a design decision that impacts the spatial dimensions of the output and the computational cost.
When a filter is applied, especially near the borders of the input, the output feature map naturally tends to shrink in size compared to the input. Furthermore, pixels at the very edge of the input are covered by the filter fewer times than pixels in the center, potentially leading to loss of information.
Padding addresses these issues by adding extra pixels (usually with a value of zero) around the border of the input volume before applying the convolution.
There are two common padding strategies:
In essence, the convolution operation uses learnable filters to scan input data, performing element-wise multiplications and summations to create feature maps. These maps highlight the presence of specific patterns (learned by the filters) at different spatial locations. Parameters like the number of filters, filter size, stride, and padding allow control over how features are extracted and the spatial dimensions of the output. This process forms the foundation for how CNNs effectively analyze grid-like data like images.
Here's a simple example using PyTorch to define a 2D convolutional layer:
import torch
import torch.nn as nn
# Example: Input with 3 channels (e.g., RGB image), batch size 1
# Input dimensions: (batch_size, channels, height, width)
input_tensor = torch.randn(1, 3, 64, 64)
# Define a convolutional layer
# in_channels=3: Matches the input depth (RGB)
# out_channels=16: We want to use 16 different filters
# kernel_size=3: Each filter will be 3x3
# stride=1: Filter moves one pixel at a time
# padding=1: Use 'same' padding (for 3x3 kernel, stride 1)
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
# Apply the convolution
output_feature_map = conv_layer(input_tensor)
# Print the output shape
# Output: torch.Size([1, 16, 64, 64])
# Batch size 1, 16 feature maps (one for each filter), height 64, width 64
# Note: Height and width remain 64 due to padding='same' (padding=1) and stride=1.
print(output_feature_map.shape)
This basic operation, repeated across multiple layers, allows CNNs to learn a hierarchy of features, starting from simple edges and textures in early layers to more complex object parts in deeper layers.
© 2025 ApX Machine Learning