Alright let's build your first, basic autoencoder. This practical exercise will help solidify your understanding of the encoder, decoder, bottleneck, and the reconstruction loss we've been discussing. We'll use the popular MNIST dataset, which consists of grayscale images of handwritten digits. Our goal is to train an autoencoder to compress these images into a lower-dimensional representation and then reconstruct them.
Before we begin, ensure you have your deep learning environment ready. For this example, we'll outline steps assuming a PyTorch setup. You'll primarily need torch
and torchvision
for data loading and model building, numpy
for numerical operations, and matplotlib
for visualizing our results.
1. Importing Libraries First, let's import the necessary libraries.
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt
2. Loading and Preparing the MNIST Dataset
The MNIST dataset is conveniently available through torchvision.datasets
. Each image is 28x28 pixels. For this basic autoencoder, we'll flatten these 28x28 images into vectors of 784 pixels. We also need to normalize the pixel values, typically to a range between 0 and 1, which helps with training stability. PyTorch's transforms.ToTensor()
handles scaling pixels to [0, 1] automatically.
# Define a transform to normalize the data and flatten images
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)), # Normalize to [-1, 1] for better training with tanh (optional, but common)
transforms.Lambda(lambda x: x.view(-1)) # Flatten the 28x28 image to 784
])
# Load the MNIST dataset
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)
# Create data loaders
batch_size = 256
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# Get a sample to check shape (optional)
sample_data, _ = next(iter(train_loader))
print(f"Sample x_train batch shape: {sample_data.shape}")
print(f"Flattened image size: {sample_data.shape[1]}")
You'll notice we are only loading the image data and ignoring the labels (_
). This is because autoencoders are trained in an unsupervised manner; their goal is to reconstruct the input, not to predict a label. The flattened image size should be 784.
An autoencoder consists of two main parts: an encoder and a decoder. The encoder maps the input data to a lower-dimensional representation (the bottleneck), and the decoder attempts to reconstruct the original input from this representation.
1. Defining the Architecture
Let's define a simple architecture using PyTorch's nn.Module
. We'll use nn.Linear
layers.
nn.Linear
layers that progressively reduce the dimensionality. For example, 784 -> 128 -> 64.nn.Linear
layers that progressively increase the dimensionality, mirroring the encoder. For example, 32 -> 64 -> 128 -> 784.nn.Sigmoid
for [0,1] or nn.Tanh
for [-1,1] if you normalized to that range).Here's how we can define it:
class Autoencoder(nn.Module):
def __init__(self, latent_dim=32):
super(Autoencoder, self).__init__()
self.latent_dim = latent_dim
# Encoder
self.encoder = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, latent_dim),
nn.ReLU() # The bottleneck layer
)
# Decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 784),
nn.Tanh() # Use Tanh if input was normalized to [-1, 1], else use Sigmoid for [0, 1]
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
latent_dim = 32
autoencoder = Autoencoder(latent_dim)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
autoencoder.to(device)
print(autoencoder)
Below is a diagram illustrating the general structure of our autoencoder:
The flow of data through the autoencoder, from input, through compression in the encoder and bottleneck, to reconstruction by the decoder.
2. Defining Loss Function and Optimizer Before training, we need to define the loss function and the optimizer. As discussed in the chapter, Mean Squared Error (MSE) is a common choice for reconstruction loss when dealing with continuous data like our normalized pixel values.
MSE=N1∑i=1N(xi−x^i)2
Here, xi is the original input and x^i is the reconstructed output. We'll use the Adam optimizer, which is a popular and effective choice for many deep learning tasks.
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3)
Now, we train the autoencoder. The distinctive aspect here is that the input data serves as both the input and the target output. The network learns to reconstruct what it's given.
num_epochs = 50
train_losses = []
val_losses = []
for epoch in range(num_epochs):
# Training phase
autoencoder.train()
running_train_loss = 0.0
for data, _ in train_loader:
data = data.to(device)
optimizer.zero_grad()
outputs = autoencoder(data)
loss = criterion(outputs, data)
loss.backward()
optimizer.step()
running_train_loss += loss.item() * data.size(0)
epoch_train_loss = running_train_loss / len(train_loader.dataset)
train_losses.append(epoch_train_loss)
# Validation phase
autoencoder.eval()
running_val_loss = 0.0
with torch.no_grad():
for data, _ in test_loader:
data = data.to(device)
outputs = autoencoder(data)
loss = criterion(outputs, data)
running_val_loss += loss.item() * data.size(0)
epoch_val_loss = running_val_loss / len(test_loader.dataset)
val_losses.append(epoch_val_loss)
print(f'Epoch [{epoch+1}/{num_epochs}], '
f'Train Loss: {epoch_train_loss:.4f}, '
f'Validation Loss: {epoch_val_loss:.4f}')
We can plot the training and validation loss to see how our model learned:
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True)
plt.show()
The true test of our autoencoder is how well it can reconstruct the input images. Let's use the trained autoencoder
to predict (reconstruct) the images from our test set and display a few of them alongside the originals.
# Reconstruct images from the test set
autoencoder.eval() # Set model to evaluation mode
with torch.no_grad():
data_iter = iter(test_loader)
data, _ = next(data_iter) # Get a batch of test data
data = data.to(device)
decoded_imgs = autoencoder(data).cpu().numpy() # Get reconstructions and move to CPU
# Display original and reconstructed images
n = 10 # Number of digits to display
plt.figure(figsize=(20, 4))
for i in range(n):
# Display original
ax = plt.subplot(2, n, i + 1)
# Undo normalization for display: data was normalized to [-1, 1], so scale back to [0, 1]
original_img = (data[i].cpu().numpy().reshape(28, 28) + 1) / 2
plt.imshow(original_img, cmap='gray')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
if i == 0:
ax.set_title("Original")
# Display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
# Undo normalization for display: decoded_imgs are [-1, 1], scale back to [0, 1]
reconstructed_img = (decoded_imgs[i].reshape(28, 28) + 1) / 2
plt.imshow(reconstructed_img, cmap='gray')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
if i == 0:
ax.set_title("Reconstructed")
plt.show()
You should see that the reconstructed digits, while perhaps a bit blurrier or less sharp than the originals, are generally recognizable. This indicates that our autoencoder has learned a meaningful, compressed representation in its 32-dimensional bottleneck layer and can use this representation to generate a reasonable approximation of the original 784-dimensional input.
In this hands-on session, you've successfully built and trained a basic autoencoder using PyTorch. You've seen how:
This simple autoencoder demonstrates the fundamental principles. The latent representation learned by the bottleneck is the foundation for feature extraction, which we will explore in much more detail in the upcoming chapters. For instance, you can get the latent representation by passing input data through autoencoder.encoder(data)
. We'll soon see how different types of autoencoders and more sophisticated architectures can learn even more powerful and useful features.
Was this section helpful?
© 2025 ApX Machine Learning