Before we examine the sophisticated architectures that define modern computer vision, let's quickly revisit the fundamental components that form the basis of nearly all Convolutional Neural Networks (CNNs). While we assume you're comfortable with these concepts, a brief recap will ensure we share a common understanding of the terminology and mechanics.
The Convolutional Layer: Learning Spatial Hierarchies
The cornerstone of any CNN is the convolutional layer. Its primary function is to detect local patterns, such as edges, corners, and textures, within an input image (or the feature map from a previous layer). This is achieved by sliding small filters, also known as kernels, across the input volume.
Key aspects of a convolutional layer include:
- Filters (Kernels): These are small matrices of learnable parameters (weights). Each filter is specialized to detect a specific type of feature. For instance, one filter might learn to detect horizontal edges, while another detects a particular color pattern. The depth of the filter matches the depth (number of channels) of the input volume.
- Feature Maps: The output of applying a filter across the input is a 2D activation map, or feature map. This map highlights the regions in the input where the filter's specific pattern was detected. A convolutional layer typically applies multiple filters, producing a stack of feature maps (a volume) as its output.
- Stride: This parameter defines the step size with which the filter moves across the input. A stride of 1 means the filter moves one pixel at a time, while a stride of 2 means it skips every other pixel, typically resulting in a smaller output feature map.
- Padding: Often, zero-padding is added around the border of the input volume. This allows the filter to process the edges of the input more effectively and can control the spatial dimensions of the output feature map. 'Same' padding aims to keep the output spatial dimensions the same as the input (given a stride of 1), while 'valid' padding uses no padding, potentially reducing the output size.
- Parameter Sharing: A significant advantage of convolutional layers is parameter sharing. The weights within a single filter are used across the entire input spatial dimension. This drastically reduces the number of parameters compared to a fully connected layer and makes the network somewhat invariant to the location of features.
Activation Functions: Introducing Non-linearity
After the convolution operation, an activation function is typically applied element-wise to the resulting feature map. Its purpose is to introduce non-linearity into the network. Without non-linearity, stacking multiple convolutional layers would be equivalent to a single, larger convolutional layer, limiting the network's ability to model complex relationships in the data.
The most common activation function in modern CNNs is the Rectified Linear Unit (ReLU):
ReLU(x)=max(0,x)
ReLU is computationally efficient and helps mitigate the vanishing gradient problem encountered with earlier activation functions like sigmoid or tanh in very deep networks. While variants like Leaky ReLU, Parametric ReLU (PReLU), or GELU exist and are used in advanced architectures, standard ReLU remains a strong baseline.
Pooling Layers: Downsampling and Invariance
Pooling layers are often inserted between successive convolutional layers. Their main goals are:
- Dimensionality Reduction: They progressively reduce the spatial dimensions (width and height) of the feature maps. This decreases the number of parameters and computational load in subsequent layers.
- Translation Invariance: Pooling makes the representations slightly more robust to small translations or distortions in the input image. By summarizing features in a local neighborhood, the exact location of the feature becomes less critical.
Common pooling strategies include:
- Max Pooling: Selects the maximum value from the feature map patch covered by the pooling window. It tends to retain the strongest activations.
- Average Pooling: Calculates the average value within the pooling window. It provides a smoother downsampling.
Like convolutional layers, pooling layers have a window size and a stride. For example, a 2×2 window with a stride of 2 is commonly used, effectively halving the width and height of the feature map while keeping the depth unchanged.
A typical sequence of operations within a CNN block, transforming an input volume into a set of abstract feature maps.
These three components, stacked in various configurations, form the basic processing units of most CNNs. Understanding their individual roles and how they interact is essential before exploring the architectural innovations designed to build deeper and more capable networks, which we will cover next.