Having conceptually explored Convolutional Neural Networks in the previous section, let's translate those ideas into a working PyTorch model. CNNs are constructed by stacking convolutional layers, activation functions, and pooling layers, followed typically by one or more fully connected layers for classification or regression. We will use PyTorch's torch.nn
module, which provides pre-built implementations of these essential components.
Our goal is to build a simple CNN that can process image-like data. We'll start by defining the network architecture as a Python class, inheriting from torch.nn.Module
.
Convolutional Layer (nn.Conv2d
): This layer applies learnable filters to the input. The key parameters are:
in_channels
: The number of channels in the input tensor (e.g., 1 for grayscale images, 3 for RGB images).out_channels
: The number of filters (and thus the number of channels in the output tensor). Each filter learns to detect different features.kernel_size
: The dimensions (height x width) of the filters. A single integer k
implies a k x k
filter.stride
: How many pixels the filter shifts at a time (default is 1).padding
: Adds padding around the input, often used to control the output spatial dimensions (default is 0).import torch
import torch.nn as nn
# Example: A Conv2d layer expecting 3 input channels (e.g., RGB),
# producing 16 output channels using 5x5 filters.
conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1, padding=2)
Pooling Layer (nn.MaxPool2d
): This layer reduces the spatial dimensions (height and width) of the feature maps, making the representation more compact and slightly more robust to variations in feature location.
kernel_size
: The size of the window over which to take the maximum value.stride
: How much the window shifts. Often set equal to kernel_size
for non-overlapping pooling.# Example: A MaxPool2d layer with a 2x2 window and a stride of 2.
# This will typically halve the height and width of the input.
pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
Activation Function (e.g., nn.ReLU
): Introduces non-linearity, allowing the network to learn complex patterns. ReLU (Rectified Linear Unit) is a common choice. It's applied element-wise: f(x)=max(0,x).
# ReLU activation function
relu1 = nn.ReLU()
Linear Layer (nn.Linear
): A standard fully connected layer. Used typically near the end of a CNN after spatial features have been extracted and flattened.
in_features
: Number of input features (requires flattening the output of convolutional/pooling layers).out_features
: Number of output features (e.g., number of classes in a classification task).# Example: A linear layer taking a flattened vector of 512 features
# and outputting 10 values (e.g., for 10 classes).
fc1 = nn.Linear(in_features=512, out_features=10)
We define our CNN by subclassing nn.Module
. The layers are typically defined in the __init__
method, and the forward pass (how data flows through the layers) is defined in the forward
method.
Let's build a CNN with the following structure:
import torch
import torch.nn as nn
import torch.nn.functional as F # Often contains activation functions and other utilities
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
# Layer definitions
# Convolutional Layer 1
self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2)
# Max Pooling Layer 1
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
# Convolutional Layer 2
self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2)
# Max Pooling Layer 2
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
# Fully connected layers
# The input features for fc1 depends on the output shape after pooling
# Input: 28x28 -> Conv1 (padding=2) -> 28x28 -> Pool1 (stride=2) -> 14x14
# -> Conv2 (padding=2) -> 14x14 -> Pool2 (stride=2) -> 7x7
# So, the flattened size is 32 channels * 7 height * 7 width = 1568
self.fc1 = nn.Linear(in_features=32 * 7 * 7, out_features=128)
self.fc2 = nn.Linear(in_features=128, out_features=10) # Output for 10 classes
def forward(self, x):
# Define the data flow through the layers
# Input x shape: [Batch Size, 1, 28, 28]
# Apply Conv1, ReLU, Pool1
x = self.pool1(F.relu(self.conv1(x)))
# Shape after pool1: [Batch Size, 16, 14, 14]
# Apply Conv2, ReLU, Pool2
x = self.pool2(F.relu(self.conv2(x)))
# Shape after pool2: [Batch Size, 32, 7, 7]
# Flatten the tensor for the fully connected layers
# -1 maintains the batch size dimension
x = x.view(-1, 32 * 7 * 7)
# Shape after view: [Batch Size, 1568]
# Apply FC1 and ReLU
x = F.relu(self.fc1(x))
# Shape after fc1: [Batch Size, 128]
# Apply FC2 (output layer, no activation here, typically applied with loss function)
x = self.fc2(x)
# Shape after fc2: [Batch Size, 10]
return x
Let's visualize the architecture flow:
Flow of data and tensor shapes through the
SimpleCNN
model. Note that channel count increases while spatial dimensions (height/width) decrease.
To use this model, first instantiate the class. Then, you can pass input data (as a PyTorch Tensor) through it. The input tensor must have the expected shape, including the batch dimension. For our SimpleCNN
, this is [batch_size, in_channels, height, width]
, specifically [N, 1, 28, 28]
where N is the number of samples in the batch.
# Instantiate the model
model = SimpleCNN()
print(model)
# Create a dummy input tensor (batch of 4 images, 1 channel, 28x28)
# Requires gradient tracking if you intend to train
dummy_input = torch.randn(4, 1, 28, 28)
# Pass the input through the model (forward pass)
output = model(dummy_input)
# Check the output shape
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}") # Expected: [4, 10]
Running this code will print the model's layer structure and confirm the output tensor shape matches our expectation ([4, 10]
), representing the scores for 10 classes for each of the 4 images in the batch.
This example demonstrates how to combine nn.Conv2d
, nn.MaxPool2d
, nn.ReLU
, and nn.Linear
layers within an nn.Module
to create a basic CNN. An important detail when designing CNNs is correctly calculating how the tensor shapes change after each layer, especially when connecting the convolutional/pooling part to the fully connected part. We'll look more closely at tracking these shapes in the next section.
© 2025 ApX Machine Learning