All Courses

Hands-on: Building a Basic Autoencoder

Alright let's build your first, basic autoencoder. This practical exercise will help solidify your understanding of the encoder, decoder, bottleneck, and the reconstruction loss we've been discussing. We'll use the popular MNIST dataset, which consists of grayscale images of handwritten digits. Our goal is to train an autoencoder to compress these images into a lower-dimensional representation and then reconstruct them.

Setting the Stage: Environment and Data

Before we begin, ensure you have your deep learning environment ready. For this example, we'll outline steps assuming a PyTorch setup. You'll primarily need torch and torchvision for data loading and model building, numpy for numerical operations, and matplotlib for visualizing our results.

1. Importing Libraries First, let's import the necessary libraries.

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt

2. Loading and Preparing the MNIST Dataset The MNIST dataset is conveniently available through torchvision.datasets. Each image is 28x28 pixels. For this basic autoencoder, we'll flatten these 28x28 images into vectors of 784 pixels. We also need to normalize the pixel values, typically to a range between 0 and 1, which helps with training stability. PyTorch's transforms.ToTensor() handles scaling pixels to [0, 1] automatically.

# Define a transform to normalize the data and flatten images
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,)), # Normalize to [-1, 1] for better training with tanh (optional, but common)
    transforms.Lambda(lambda x: x.view(-1)) # Flatten the 28x28 image to 784
])

# Load the MNIST dataset
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
batch_size = 256
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Get a sample to check shape (optional)
sample_data, _ = next(iter(train_loader))
print(f"Sample x_train batch shape: {sample_data.shape}")
print(f"Flattened image size: {sample_data.shape[1]}")

You'll notice we are only loading the image data and ignoring the labels (_). This is because autoencoders are trained in an unsupervised manner; their goal is to reconstruct the input, not to predict a label. The flattened image size should be 784.

Designing Our Basic Autoencoder

An autoencoder consists of two main parts: an encoder and a decoder. The encoder maps the input data to a lower-dimensional representation (the bottleneck), and the decoder attempts to reconstruct the original input from this representation.

1. Defining the Architecture Let's define a simple architecture using PyTorch's nn.Module. We'll use nn.Linear layers.

Input Layer: This will match the shape of our flattened MNIST images (784 features).
Encoder: A sequence of nn.Linear layers that progressively reduce the dimensionality. For example, 784 -> 128 -> 64.
Bottleneck Layer: This is the smallest layer in our network, representing the compressed latent space. Let's choose a dimensionality of 32 for this example. This is significantly smaller than the input 784, making it an undercomplete autoencoder.
Decoder: A sequence of nn.Linear layers that progressively increase the dimensionality, mirroring the encoder. For example, 32 -> 64 -> 128 -> 784.
Output Layer: This layer should have the same number of units as the input (784) and use an activation function suitable for reconstructing the normalized pixel values (e.g., nn.Sigmoid for [0,1] or nn.Tanh for [-1,1] if you normalized to that range).

Here's how we can define it:

class Autoencoder(nn.Module):
    def __init__(self, latent_dim=32):
        super(Autoencoder, self).__init__()
        self.latent_dim = latent_dim

        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim),
            nn.ReLU() # The bottleneck layer
        )

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, 784),
            nn.Tanh() # Use Tanh if input was normalized to [-1, 1], else use Sigmoid for [0, 1]
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

latent_dim = 32
autoencoder = Autoencoder(latent_dim)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
autoencoder.to(device)

print(autoencoder)

Below is a diagram illustrating the general structure of our autoencoder:

The flow of data through the autoencoder, from input, through compression in the encoder and bottleneck, to reconstruction by the decoder.

2. Defining Loss Function and Optimizer Before training, we need to define the loss function and the optimizer. As discussed in the chapter, Mean Squared Error (MSE) is a common choice for reconstruction loss when dealing with continuous data like our normalized pixel values.

$MSE = \frac{1}{N} \sum_{i=1}^{N} (x_i - \hat{x}_i)^2$

Here, $x_i$ is the original input and $\hat{x}_i$ is the reconstructed output. We'll use the Adam optimizer, which is a popular and effective choice for many deep learning tasks.

criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3)

Training the Autoencoder

Now, we train the autoencoder. The distinctive aspect here is that the input data serves as both the input and the target output. The network learns to reconstruct what it's given.

num_epochs = 50
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    # Training phase
    autoencoder.train()
    running_train_loss = 0.0
    for data, _ in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        outputs = autoencoder(data)
        loss = criterion(outputs, data)
        loss.backward()
        optimizer.step()
        running_train_loss += loss.item() * data.size(0)

    epoch_train_loss = running_train_loss / len(train_loader.dataset)
    train_losses.append(epoch_train_loss)

    # Validation phase
    autoencoder.eval()
    running_val_loss = 0.0
    with torch.no_grad():
        for data, _ in test_loader:
            data = data.to(device)
            outputs = autoencoder(data)
            loss = criterion(outputs, data)
            running_val_loss += loss.item() * data.size(0)

    epoch_val_loss = running_val_loss / len(test_loader.dataset)
    val_losses.append(epoch_val_loss)

    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Train Loss: {epoch_train_loss:.4f}, '
          f'Validation Loss: {epoch_val_loss:.4f}')

We can plot the training and validation loss to see how our model learned:

plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()

Visualizing the Reconstructions

The true test of our autoencoder is how well it can reconstruct the input images. Let's use the trained autoencoder to predict (reconstruct) the images from our test set and display a few of them alongside the originals.

# Reconstruct images from the test set
autoencoder.eval() # Set model to evaluation mode
with torch.no_grad():
    data_iter = iter(test_loader)
    data, _ = next(data_iter) # Get a batch of test data
    data = data.to(device)
    decoded_imgs = autoencoder(data).cpu().numpy() # Get reconstructions and move to CPU

# Display original and reconstructed images
n = 10  # Number of digits to display
plt.figure(figsize=(20, 4))
for i in range(n):
    # Display original
    ax = plt.subplot(2, n, i + 1)
    # Undo normalization for display: data was normalized to [-1, 1], so scale back to [0, 1]
    original_img = (data[i].cpu().numpy().reshape(28, 28) + 1) / 2
    plt.imshow(original_img, cmap='gray')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    if i == 0:
        ax.set_title("Original")

    # Display reconstruction
    ax = plt.subplot(2, n, i + 1 + n)
    # Undo normalization for display: decoded_imgs are [-1, 1], scale back to [0, 1]
    reconstructed_img = (decoded_imgs[i].reshape(28, 28) + 1) / 2
    plt.imshow(reconstructed_img, cmap='gray')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    if i == 0:
        ax.set_title("Reconstructed")
plt.show()

You should see that the reconstructed digits, while perhaps a bit blurrier or less sharp than the originals, are generally recognizable. This indicates that our autoencoder has learned a meaningful, compressed representation in its 32-dimensional bottleneck layer and can use this representation to generate a reasonable approximation of the original 784-dimensional input.

What We've Accomplished

In this hands-on session, you've successfully built and trained a basic autoencoder using PyTorch. You've seen how:

The encoder maps high-dimensional input to a lower-dimensional bottleneck.
The decoder attempts to reconstruct the original input from this compressed representation.
The network is trained by minimizing a reconstruction loss (MSE in our case), where the input itself is the target.

This simple autoencoder demonstrates the fundamental principles. The latent representation learned by the bottleneck is the foundation for feature extraction, which we will explore in much more detail in the upcoming chapters. For instance, you can get the latent representation by passing input data through autoencoder.encoder(data). We'll soon see how different types of autoencoders and more sophisticated architectures can learn even more powerful and useful features.

Was this section helpful?