You're right, that markdown section has a few issues. The Plotly JSON was hardcoded and the explanation mentioned "brevity" which is misleading when it was just simplified data. The goal is to generate actual (though concise) MNIST image data for the Plotly heatmap.
Here's the fixed markdown section. I've focused on:
Alright, let's put theory into practice. In the preceding sections, we've discussed how Convolutional Autoencoders (ConvAEs) are well-suited for image data, leveraging convolutional and pooling layers to respect spatial hierarchies. Now, you'll build one yourself to extract features from images. We'll use the popular MNIST dataset, which consists of grayscale images of handwritten digits. This practical exercise will solidify your understanding of ConvAE architecture and its application in feature learning.
Our goal is to train a ConvAE to reconstruct MNIST images and then use its encoder part to transform these images into a lower-dimensional feature representation.
First, ensure you have PyTorch and Torchvision installed. If you've been following along with the course, your environment should be ready. We'll also use NumPy for numerical operations and Matplotlib or Plotly for visualizations. For the embedded visualizations here, we'll prepare Plotly JSON.
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor, Resize
# For visualizations if you run this in a notebook:
# import matplotlib.pyplot as plt
# For t-SNE:
# from sklearn.manifold import TSNE
MNIST images are 28x28 pixels. For convolutional layers in PyTorch, we need the format (Channels, Height, Width). We'll also normalize the pixel values to the range [0, 1], which is good practice for training neural networks. The torchvision
library makes this easy.
# Load MNIST dataset and apply transformations
transform = ToTensor() # Converts images to PyTorch tensors and normalizes to [0, 1]
train_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)
# Get a sample to check the shape
sample_data, _ = next(iter(train_loader))
print(f"Sample batch shape: {sample_data.shape}")
You should see output like:
Sample batch shape: torch.Size([128, 1, 28, 28])
The encoder's job is to compress the input image into a compact latent representation. It typically consists of a series of nn.Conv2d
layers (to learn features) followed by nn.MaxPool2d
layers (to downsample and reduce dimensionality).
Let's define an encoder that maps the 1x28x28 input image to a latent vector of 64 dimensions.
latent_dim = 64 # Dimensionality of the latent space
class Encoder(nn.Module):
def __init__(self):
super(Encoder, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1) # -> 16x28x28
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) # -> 16x14x14
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1) # -> 32x14x14
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) # -> 32x7x7
self.flatten = nn.Flatten()
# The flattened size is 32 * 7 * 7 = 1568
self.fc = nn.Linear(32 * 7 * 7, latent_dim)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.conv1(x))
x = self.pool1(x)
x = self.relu(self.conv2(x))
x = self.pool2(x)
x = self.flatten(x)
x = self.relu(self.fc(x))
return x
# Instantiate and print the encoder
encoder = Encoder()
print(encoder)
Printing the encoder
will show the architecture and layers. Notice how the spatial dimensions decrease while the number of filters (features) can increase, capturing more complex patterns before being condensed into the latent_dim
vector. The padding=1
with a kernel_size=3
ensures that the output feature map has the same spatial dimensions as the input (before pooling), making architecture design a bit more straightforward.
The decoder's task is the opposite of the encoder: to reconstruct the original image from the latent representation. It often mirrors the encoder's architecture but uses nn.ConvTranspose2d
layers to increase the spatial dimensions.
class Decoder(nn.Module):
def __init__(self):
super(Decoder, self).__init__()
# Dense layer to upscale from latent dim to the pre-flattened size
self.fc = nn.Linear(latent_dim, 32 * 7 * 7)
# Reshape will be done in the forward pass using .view()
self.convT1 = nn.ConvTranspose2d(32, 16, kernel_size=2, stride=2) # -> 16x14x14
self.convT2 = nn.ConvTranspose2d(16, 1, kernel_size=2, stride=2) # -> 1x28x28
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.fc(x))
x = x.view(-1, 32, 7, 7) # Reshape to 32x7x7
x = self.relu(self.convT1(x))
x = self.sigmoid(self.convT2(x)) # Sigmoid for [0,1] pixel values
return x
# Instantiate and print the decoder
decoder = Decoder()
print(decoder)
The nn.ConvTranspose2d
layers with stride=2
effectively double the spatial dimensions at each step. The final layer uses a sigmoid activation because our input images were normalized to be between 0 and 1.
Now, we combine the encoder and decoder into a full autoencoder model. In PyTorch, this is another nn.Module
that calls the encoder and decoder in sequence. We also define our loss function and optimizer.
class Autoencoder(nn.Module):
def __init__(self, encoder, decoder):
super(Autoencoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
autoencoder = Autoencoder(encoder, decoder)
print(autoencoder)
# Define the loss function and optimizer
criterion = nn.BCELoss() # Binary Cross-Entropy Loss for pixel-wise comparison
optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3)
We use BCELoss
as the loss function, which is suitable for comparing pixel values that are between 0 and 1 (due to the sigmoid activation in the decoder's last layer). The Adam
optimizer is a common and effective choice.
A diagram can help visualize this architecture:
The Convolutional Autoencoder architecture. The encoder maps the input image to a low-dimensional latent vector, and the decoder attempts to reconstruct the original image from this vector.
With the model, loss, and optimizer defined, we can write our training loop. The autoencoder learns to reconstruct its input, so the input images serve as both the input and the target.
import torch
import matplotlib.pyplot as plt
from torchvision.transforms import Resize
import numpy as np
# Assuming autoencoder, test_loader, and device are already defined from your previous setup
# Predict on test images
autoencoder.eval() # Set model to evaluation mode
reconstructed_imgs = []
original_imgs = []
# Define a transformation to resize images to 8x8 pixels
resize_transform = Resize((16, 16))
with torch.no_grad():
for i, data in enumerate(test_loader):
imgs, _ = data
imgs = imgs.to(device)
outputs = autoencoder(imgs)
# Store the first 5 images from the first batch
if i == 0:
# Resize original and reconstructed images to 8x8
original_imgs = resize_transform(imgs).cpu().numpy()
reconstructed_imgs = resize_transform(outputs).cpu().numpy()
break
# Prepare for displaying with Matplotlib
n_display = 5
fig, axes = plt.subplots(2, n_display, figsize=(n_display * 2, 4)) # Adjust figsize as needed
for i in range(n_display):
# Original image
axes[0, i].imshow(original_imgs[i, 0], cmap='Greys')
axes[0, i].axis('off') # Turn off axis
if i == 0:
axes[0, i].set_title("Original Images", fontsize=12)
# Reconstructed image
axes[1, i].imshow(reconstructed_imgs[i, 0], cmap='Greys')
axes[1, i].axis('off') # Turn off axis
if i == 0:
axes[1, i].set_title("Reconstructed Images", fontsize=12)
plt.tight_layout(rect=[0, 0, 1, 0.95]) # Adjust layout to make space for titles
plt.suptitle("Original vs. Reconstructed Images", fontsize=16, y=1.0) # Overall title
plt.show()
Comparison of original MNIST test images (top row) and their reconstructions by the ConvAE (bottom row). The reconstructions should be recognizable, though perhaps a bit blurrier than the originals.
The quality of reconstructions depends on the model architecture, latent dimension size, and training duration. More complex models or longer training might yield sharper images.
The primary goal of this exercise is feature extraction. The encoder part of our trained autoencoder can now be used to transform input images into their latent_dim
-dimensional feature vectors.
# Use the trained encoder to get latent representations (features)
encoder.eval() # Set encoder to evaluation mode
all_features = []
all_labels = []
# For the full dataset, iterate through the loaders
full_train_loader = DataLoader(train_dataset, batch_size=1024)
full_test_loader = DataLoader(test_dataset, batch_size=1024)
with torch.no_grad():
for data in full_train_loader: # Using a larger batch size for inference
imgs, labels = data
imgs = imgs.to(device)
features = encoder(imgs)
all_features.append(features.cpu().numpy())
all_labels.append(labels.numpy())
encoded_features_train = np.concatenate(all_features, axis=0)
y_train = np.concatenate(all_labels, axis=0)
# Repeat for the test set
all_features = []
all_labels = []
with torch.no_grad():
for data in full_test_loader:
imgs, labels = data
imgs = imgs.to(device)
features = encoder(imgs)
all_features.append(features.cpu().numpy())
all_labels.append(labels.numpy())
encoded_features_test = np.concatenate(all_features, axis=0)
y_test = np.concatenate(all_labels, axis=0)
print(f"Shape of training features: {encoded_features_train.shape}")
print(f"Shape of test features: {encoded_features_test.shape}")
This will output:
Shape of training features: (60000, 64)
Shape of test features: (10000, 64)
Each image is now represented by a vector of 64 numbers. These features are learned by the autoencoder to capture the essential information needed to reconstruct the original image. They are often more semantically meaningful than the raw pixel values and can be used for downstream tasks like classification or clustering.
To get a sense of how the autoencoder has organized the data in its latent space, we can use a dimensionality reduction technique like t-SNE to project the 64-dimensional features down to 2 dimensions and then plot them, coloring by the original digit labels.
# # The following code uses scikit-learn for t-SNE and would be run in a Python environment.
# # It might be computationally intensive for the full dataset, so a subset is often used for visualization.
# from sklearn.manifold import TSNE
# import plotly.express as px
# # Use a subset of test features for t-SNE (e.g., first 5000 samples)
# num_samples_tsne = 5000
# tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=300)
# latent_2d = tsne.fit_transform(encoded_features_test[:num_samples_tsne])
# # Create a Plotly scatter plot
# fig = px.scatter(x=latent_2d[:, 0], y=latent_2d[:, 1], color=y_test[:num_samples_tsne].astype(str),
# # labels={'color': 'Digit'}, title="t-SNE visualization of MNIST latent space (ConvAE features)")
# # fig.show() # In a notebook
# For static display, here's an example Plotly JSON structure for a t-SNE plot.
# This would be populated with actual t-SNE results.
tsne_plot_data = []
# Placeholder: Manually create a few sample points for 10 classes as Plotly JSON
# This is illustrative; real t-SNE would generate these points.
sample_points = {
0: [[-5, -5], [-5.5, -4.5]], 1: [[5, 5], [5.5, 4.5]], 2: [[-5, 5], [-4.5, 5.5]],
3: [[5, -5], [4.5, -5.5]], 4: [[0, 0], [0.5, 0.5]], 5: [[-2, -2], [-1.5, -2.5]],
6: [[2, 2], [1.5, 2.5]], 7: [[-2, 2], [-2.5, 1.5]], 8: [[2, -2], [2.5, -1.5]],
9: [[0, 3], [0, 3.5]]
}
colors_map_plotly = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
for digit_class in range(10):
points = np.array(sample_points.get(digit_class, []))
if points.shape[0] > 0:
tsne_plot_data.append({
"type": "scatter",
"mode": "markers",
"x": points[:,0].tolist(),
"y": points[:,1].tolist(),
"name": f"Digit {digit_class}",
"marker": {"color": colors_map_plotly[digit_class], "size": 8}
})
An illustrative t-SNE visualization of the learned latent features from the ConvAE. Ideally, points corresponding to the same digit would cluster together, and different digits would form distinct (or somewhat separated) clusters.
If the autoencoder has learned well, you should see some separation between clusters of different digits. This indicates that the latent features capture discriminative information about the digit classes, even though the autoencoder was trained purely on reconstruction without any label information.
In this session, you've successfully:
torchvision
.These extracted features, encoded_features_train
and encoded_features_test
, are now ready to be used in various downstream machine learning tasks, such as classification, which we will explore further in Chapter 7. This exercise demonstrates the power of ConvAEs for learning compact and useful representations from image data.
Was this section helpful?
© 2025 ApX Machine Learning