All Courses

Implementing Convolutional Autoencoders: Practice

Having explored the theoretical underpinnings of Convolutional Autoencoders (CAEs) earlier in this chapter, we now turn to their practical implementation. CAEs are particularly effective for data with spatial hierarchies, such as images, because they leverage convolutional layers to capture local patterns and pooling layers to create spatially invariant representations. This hands-on section will guide you through building and training a CAE for image reconstruction using a standard dataset. We will use the Keras API with a TensorFlow backend for this example, assuming you have a working deep learning environment set up.

Setting Up and Preparing Data

First, ensure you have TensorFlow installed. We'll use the Fashion-MNIST dataset, a slightly more complex alternative to MNIST, consisting of 28x28 grayscale images of clothing items.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load the Fashion-MNIST dataset
(x_train, _), (x_test, _) = keras.datasets.fashion_mnist.load_data()

# Normalize pixel values to [0, 1] and reshape for Conv2D input
def preprocess_data(x):
  x = x.astype('float32') / 255.0
  # Add channel dimension (grayscale)
  x = np.reshape(x, (len(x), 28, 28, 1))
  return x

x_train = preprocess_data(x_train)
x_test = preprocess_data(x_test)

print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")

Normalizing the data to the [0, 1] range is standard practice and often works well with loss functions like Binary Crossentropy or Mean Squared Error for reconstruction tasks. Adding the channel dimension is necessary because Conv2D layers in TensorFlow/Keras expect input tensors with shape (batch_size, height, width, channels).

Designing the Convolutional Autoencoder Architecture

A CAE typically consists of an encoder that maps the input image to a compressed latent representation and a decoder that reconstructs the image from this representation.

Encoder: The encoder uses Conv2D layers, often paired with ReLU activation, to extract features. MaxPooling2D layers are used to reduce the spatial dimensions (height and width) while increasing the representational capacity through learned filters. We progressively decrease spatial resolution and increase the number of filters.

Decoder: The decoder mirrors the encoder structure but uses upsampling layers to increase spatial resolution and decrease the number of filters. Conv2DTranspose layers are commonly used for upsampling. These layers perform a transposed convolution, effectively learning how to reverse the downsampling process of the Conv2D and MaxPooling2D layers. Alternatively, one can use UpSampling2D followed by a Conv2D layer. The final layer typically uses a sigmoid activation to output pixel values in the [0, 1] range, matching our normalized input data.

Let's define a simple CAE model:

latent_dim = 32 # Dimension of the bottleneck layer

# --- Encoder ---
encoder_inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2)(encoder_inputs) # 28x28 -> 14x14
x = layers.Conv2D(64, (3, 3), activation='relu', padding='same', strides=2)(x) # 14x14 -> 7x7

# Shape info needed for the latent bottleneck / start of decoder
shape_before_flattening = tf.keras.backend.int_shape(x)[1:] # (7, 7, 64)
# We can use a Conv2D layer as the bottleneck directly or flatten and use Dense
# Here, using Conv2D to maintain spatial structure info as much as possible
encoder_output = layers.Conv2D(latent_dim, (3, 3), activation='relu', padding='same')(x) # Bottleneck: 7x7xlatent_dim

encoder = keras.Model(encoder_inputs, encoder_output, name="encoder")
# encoder.summary() # Optional: view encoder structure

# --- Decoder ---
decoder_inputs = keras.Input(shape=(7, 7, latent_dim)) # Input shape matches encoder output
x = layers.Conv2DTranspose(64, (3, 3), activation='relu', padding='same', strides=2)(decoder_inputs) # 7x7 -> 14x14
x = layers.Conv2DTranspose(32, (3, 3), activation='relu', padding='same', strides=2)(x) # 14x14 -> 28x28
decoder_outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x) # 28x28 -> 28x28x1

decoder = keras.Model(decoder_inputs, decoder_outputs, name="decoder")
# decoder.summary() # Optional: view decoder structure

# --- Autoencoder (Encoder + Decoder) ---
autoencoder_input = keras.Input(shape=(28, 28, 1))
encoded = encoder(autoencoder_input)
decoded = decoder(encoded)
autoencoder = keras.Model(autoencoder_input, decoded, name="autoencoder")

autoencoder.summary()

Note the use of padding='same' which helps maintain spatial dimensions after convolution, making the architecture design slightly simpler. The strides=2 in Conv2D and Conv2DTranspose handles the downsampling and upsampling respectively. The bottleneck here is a small feature map ( $7 \times 7 \times 32$ ).

Here's a visualization of the combined autoencoder architecture:

The diagram shows the flow from input image through the downsampling convolutional layers of the encoder to the bottleneck, followed by the upsampling transpose convolutional layers of the decoder to the reconstructed output image.

Compiling and Training the Model

We need to compile the autoencoder model, specifying an optimizer and a loss function. Since our pixel values are normalized to [0, 1] and we are treating the reconstruction as predicting the probability of each pixel's intensity, Binary Crossentropy (binary_crossentropy) is a suitable loss function. Mean Squared Error (mse) is also a common choice for image reconstruction. We'll use the Adam optimizer.

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the autoencoder
epochs = 20 # Adjust as needed
batch_size = 128

history = autoencoder.fit(x_train, x_train, # Input and target are the same
                          epochs=epochs,
                          batch_size=batch_size,
                          shuffle=True,
                          validation_data=(x_test, x_test)) # Use test set for validation

During training, the model learns to minimize the difference between the input images (x_train) and the images reconstructed by passing them through the encoder and then the decoder. Monitoring the validation loss (val_loss) helps check for overfitting.

Let's visualize the training progress:

# Plot training & validation loss values
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'], label='Train Loss', color='#1c7ed6')
plt.plot(history.history['val_loss'], label='Validation Loss', color='#fd7e14')
plt.title('Model Loss During Training')
plt.ylabel('Loss (Binary Crossentropy)')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

Training and validation loss curves for the Convolutional Autoencoder. Ideally, both curves decrease and converge.

Evaluating the Reconstruction Quality

The primary way to evaluate an autoencoder for reconstruction is to visually inspect its output on unseen data (the test set).

# Use the trained autoencoder to reconstruct images from the test set
reconstructed_imgs = autoencoder.predict(x_test)

# Display original and reconstructed images
n = 10  # Number of images to display
plt.figure(figsize=(20, 4))
for i in range(n):
    # Display original
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    if i == 0:
        ax.set_title("Original")

    # Display reconstruction
    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(reconstructed_imgs[i].reshape(28, 28))
    plt.gray()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    if i == 0:
        ax.set_title("Reconstructed")
plt.suptitle('Original vs. Reconstructed Images (Test Set)')
plt.show()

You should observe that the reconstructed images are somewhat blurry versions of the originals, capturing the main features but losing some fine details. This is expected, as the bottleneck layer forces the model to learn a compressed representation. The quality of reconstruction depends heavily on the model architecture, the size of the latent dimension, the dataset complexity, and the training duration.

Further Exploration

This example provides a starting point for implementing CAEs. You can experiment further by:

Changing Architecture: Try deeper or shallower networks, different filter sizes, or using UpSampling2D followed by Conv2D instead of Conv2DTranspose in the decoder.
Adjusting Latent Dimension: Explore how changing latent_dim affects reconstruction quality and the level of compression.
Using Different Datasets: Apply the CAE to other image datasets like CIFAR-10 (remembering to handle 3 color channels).
Adding Denoising: Modify the training process to create a Denoising CAE by adding noise to the input images (x_train_noisy) while keeping the clean images (x_train) as the target. This often leads to more feature extraction.
Feature Extraction: Use the trained encoder part of the model to extract compressed features from images for downstream tasks like classification.

By building and experimenting with CAEs, you gain practical experience in designing architectures tailored for spatial data and understanding the trade-offs involved in representation learning through compression.

Was this section helpful?