All Courses

Building a Simple CNN in PyTorch

Having explored Convolutional Neural Networks in the previous section, let's translate those ideas into a working PyTorch model. CNNs are constructed by stacking convolutional layers, activation functions, and pooling layers, followed typically by one or more fully connected layers for classification or regression. We will use PyTorch's torch.nn module, which provides pre-built implementations of these essential components.

Our goal is to build a simple CNN that can process image-like data. We'll start by defining the network architecture as a Python class, inheriting from torch.nn.Module.

Core CNN Layers in PyTorch

Convolutional Layer (nn.Conv2d): This layer applies learnable filters to the input. The main parameters are:
- in_channels: The number of channels in the input tensor (e.g., 1 for grayscale images, 3 for RGB images).
- out_channels: The number of filters (and thus the number of channels in the output tensor). Each filter learns to detect different features.
- kernel_size: The dimensions (height x width) of the filters. A single integer k implies a k x k filter.
- stride: How many pixels the filter shifts at a time (default is 1).
- padding: Adds padding around the input, often used to control the output spatial dimensions (default is 0).
```
import torch
import torch.nn as nn

# Example: A Conv2d layer expecting 3 input channels (e.g., RGB),
# producing 16 output channels using 5x5 filters.
conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1, padding=2)
```
Pooling Layer (nn.MaxPool2d): This layer reduces the spatial dimensions (height and width) of the feature maps, making the representation more compact and slightly more precise to variations in feature location.
- kernel_size: The size of the window over which to take the maximum value.
- stride: How much the window shifts. Often set equal to kernel_size for non-overlapping pooling.
```
# Example: A MaxPool2d layer with a 2x2 window and a stride of 2.
# This will typically halve the height and width of the input.
pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
```
Activation Function (e.g., nn.ReLU): Introduces non-linearity, allowing the network to learn complex patterns. ReLU (Rectified Linear Unit) is a common choice. It's applied element-wise: $f(x) = max(0, x)$ .
```
# ReLU activation function
relu1 = nn.ReLU()
```
Linear Layer (nn.Linear): A standard fully connected layer. Used typically near the end of a CNN after spatial features have been extracted and flattened.
- in_features: Number of input features (requires flattening the output of convolutional/pooling layers).
- out_features: Number of output features (e.g., number of classes in a classification task).
```
# Example: A linear layer taking a flattened vector of 512 features
# and outputting 10 values (e.g., for 10 classes).
fc1 = nn.Linear(in_features=512, out_features=10)
```

Defining the CNN Architecture

We define our CNN by subclassing nn.Module. The layers are typically defined in the __init__ method, and the forward pass (how data flows through the layers) is defined in the forward method.

Let's build a CNN with the following structure:

Input: [Batch Size, 1, 28, 28] (e.g., grayscale images like MNIST)
Conv1: 1 input channel, 16 output channels, 5x5 kernel, stride 1, padding 2
ReLU1
MaxPool1: 2x2 kernel, stride 2
Conv2: 16 input channels, 32 output channels, 5x5 kernel, stride 1, padding 2
ReLU2
MaxPool2: 2x2 kernel, stride 2
Flatten
Linear1: (Input features depend on output of MaxPool2), 128 output features
ReLU3
Linear2: 128 input features, 10 output features (e.g., for 10 classes)

import torch
import torch.nn as nn
import torch.nn.functional as F # Often contains activation functions and other utilities

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # Layer definitions
        # Convolutional Layer 1
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2)
        # Max Pooling Layer 1
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        # Convolutional Layer 2
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2)
        # Max Pooling Layer 2
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully connected layers
        # The input features for fc1 depends on the output shape after pooling
        # Input: 28x28 -> Conv1 (padding=2) -> 28x28 -> Pool1 (stride=2) -> 14x14
        # -> Conv2 (padding=2) -> 14x14 -> Pool2 (stride=2) -> 7x7
        # So, the flattened size is 32 channels * 7 height * 7 width = 1568
        self.fc1 = nn.Linear(in_features=32 * 7 * 7, out_features=128)
        self.fc2 = nn.Linear(in_features=128, out_features=10) # Output for 10 classes

    def forward(self, x):
        # Define the data flow through the layers
        # Input x shape: [Batch Size, 1, 28, 28]

        # Apply Conv1, ReLU, Pool1
        x = self.pool1(F.relu(self.conv1(x)))
        # Shape after pool1: [Batch Size, 16, 14, 14]

        # Apply Conv2, ReLU, Pool2
        x = self.pool2(F.relu(self.conv2(x)))
        # Shape after pool2: [Batch Size, 32, 7, 7]

        # Flatten the tensor for the fully connected layers
        # -1 maintains the batch size dimension
        x = x.view(-1, 32 * 7 * 7)
        # Shape after view: [Batch Size, 1568]

        # Apply FC1 and ReLU
        x = F.relu(self.fc1(x))
        # Shape after fc1: [Batch Size, 128]

        # Apply FC2 (output layer, no activation here, typically applied with loss function)
        x = self.fc2(x)
        # Shape after fc2: [Batch Size, 10]
        return x

Let's visualize the architecture flow:

Flow of data and tensor shapes through the SimpleCNN model. Note that channel count increases while spatial dimensions (height/width) decrease.

Using the Model

To use this model, first instantiate the class. Then, you can pass input data (as a PyTorch Tensor) through it. The input tensor must have the expected shape, including the batch dimension. For our SimpleCNN, this is [batch_size, in_channels, height, width], specifically [N, 1, 28, 28] where N is the number of samples in the batch.

# Instantiate the model
model = SimpleCNN()
print(model)

# Create a dummy input tensor (batch of 4 images, 1 channel, 28x28)
# Requires gradient tracking if you intend to train
dummy_input = torch.randn(4, 1, 28, 28)

# Pass the input through the model (forward pass)
output = model(dummy_input)

# Check the output shape
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}") # Expected: [4, 10]

Running this code will print the model's layer structure and confirm the output tensor shape matches our expectation ([4, 10]), representing the scores for 10 classes for each of the 4 images in the batch.

This example demonstrates how to combine nn.Conv2d, nn.MaxPool2d, nn.ReLU, and nn.Linear layers within an nn.Module to create a basic CNN. An important detail when designing CNNs is correctly calculating how the tensor shapes change after each layer, especially when connecting the convolutional/pooling part to the fully connected part. We'll look more closely at tracking these shapes in the next section.

Was this section helpful?