Having explored the theoretical underpinnings of Convolutional Autoencoders (CAEs) earlier in this chapter, we now turn to their practical implementation. CAEs are particularly effective for data with spatial hierarchies, such as images, because they leverage convolutional layers to capture local patterns and pooling layers to create spatially invariant representations. This hands-on section will guide you through building and training a CAE for image reconstruction using a standard dataset. We will use the Keras API with a TensorFlow backend for this example, assuming you have a working deep learning environment set up.
First, ensure you have TensorFlow installed. We'll use the Fashion-MNIST dataset, a slightly more complex alternative to MNIST, consisting of 28x28 grayscale images of clothing items.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Load the Fashion-MNIST dataset
(x_train, _), (x_test, _) = keras.datasets.fashion_mnist.load_data()
# Normalize pixel values to [0, 1] and reshape for Conv2D input
def preprocess_data(x):
x = x.astype('float32') / 255.0
# Add channel dimension (grayscale)
x = np.reshape(x, (len(x), 28, 28, 1))
return x
x_train = preprocess_data(x_train)
x_test = preprocess_data(x_test)
print(f"x_train shape: {x_train.shape}")
print(f"x_test shape: {x_test.shape}")
Normalizing the data to the [0, 1] range is standard practice and often works well with loss functions like Binary Crossentropy or Mean Squared Error for reconstruction tasks. Adding the channel dimension is necessary because Conv2D
layers in TensorFlow/Keras expect input tensors with shape (batch_size, height, width, channels)
.
A CAE typically consists of an encoder that maps the input image to a compressed latent representation and a decoder that reconstructs the image from this representation.
Encoder: The encoder uses Conv2D
layers, often paired with ReLU activation, to extract features. MaxPooling2D
layers are used to reduce the spatial dimensions (height and width) while increasing the representational capacity through learned filters. We progressively decrease spatial resolution and increase the number of filters.
Decoder: The decoder mirrors the encoder structure but uses upsampling layers to increase spatial resolution and decrease the number of filters. Conv2DTranspose
layers are commonly used for upsampling. These layers perform a transposed convolution, effectively learning how to reverse the downsampling process of the Conv2D
and MaxPooling2D
layers. Alternatively, one can use UpSampling2D
followed by a Conv2D
layer. The final layer typically uses a sigmoid
activation to output pixel values in the [0, 1] range, matching our normalized input data.
Let's define a simple CAE model:
latent_dim = 32 # Dimension of the bottleneck layer
# --- Encoder ---
encoder_inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2)(encoder_inputs) # 28x28 -> 14x14
x = layers.Conv2D(64, (3, 3), activation='relu', padding='same', strides=2)(x) # 14x14 -> 7x7
# Shape info needed for the latent bottleneck / start of decoder
shape_before_flattening = tf.keras.backend.int_shape(x)[1:] # (7, 7, 64)
# We can use a Conv2D layer as the bottleneck directly or flatten and use Dense
# Here, using Conv2D to maintain spatial structure info as much as possible
encoder_output = layers.Conv2D(latent_dim, (3, 3), activation='relu', padding='same')(x) # Bottleneck: 7x7xlatent_dim
encoder = keras.Model(encoder_inputs, encoder_output, name="encoder")
# encoder.summary() # Optional: view encoder structure
# --- Decoder ---
decoder_inputs = keras.Input(shape=(7, 7, latent_dim)) # Input shape matches encoder output
x = layers.Conv2DTranspose(64, (3, 3), activation='relu', padding='same', strides=2)(decoder_inputs) # 7x7 -> 14x14
x = layers.Conv2DTranspose(32, (3, 3), activation='relu', padding='same', strides=2)(x) # 14x14 -> 28x28
decoder_outputs = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x) # 28x28 -> 28x28x1
decoder = keras.Model(decoder_inputs, decoder_outputs, name="decoder")
# decoder.summary() # Optional: view decoder structure
# --- Autoencoder (Encoder + Decoder) ---
autoencoder_input = keras.Input(shape=(28, 28, 1))
encoded = encoder(autoencoder_input)
decoded = decoder(encoded)
autoencoder = keras.Model(autoencoder_input, decoded, name="autoencoder")
autoencoder.summary()
Note the use of padding='same'
which helps maintain spatial dimensions after convolution, making the architecture design slightly simpler. The strides=2
in Conv2D
and Conv2DTranspose
handles the downsampling and upsampling respectively. The bottleneck here is a small feature map (7×7×32).
Here's a visualization of the combined autoencoder architecture:
The diagram shows the flow from input image through the downsampling convolutional layers of the encoder to the bottleneck, followed by the upsampling transpose convolutional layers of the decoder to the reconstructed output image.
We need to compile the autoencoder
model, specifying an optimizer and a loss function. Since our pixel values are normalized to [0, 1] and we are treating the reconstruction as predicting the probability of each pixel's intensity, Binary Crossentropy (binary_crossentropy
) is a suitable loss function. Mean Squared Error (mse
) is also a common choice for image reconstruction. We'll use the Adam optimizer.
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
# Train the autoencoder
epochs = 20 # Adjust as needed
batch_size = 128
history = autoencoder.fit(x_train, x_train, # Input and target are the same
epochs=epochs,
batch_size=batch_size,
shuffle=True,
validation_data=(x_test, x_test)) # Use test set for validation
During training, the model learns to minimize the difference between the input images (x_train
) and the images reconstructed by passing them through the encoder and then the decoder. Monitoring the validation loss (val_loss
) helps check for overfitting.
Let's visualize the training progress:
# Plot training & validation loss values
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'], label='Train Loss', color='#1c7ed6')
plt.plot(history.history['val_loss'], label='Validation Loss', color='#fd7e14')
plt.title('Model Loss During Training')
plt.ylabel('Loss (Binary Crossentropy)')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Training and validation loss curves for the Convolutional Autoencoder. Ideally, both curves decrease and converge.
The primary way to evaluate an autoencoder for reconstruction is to visually inspect its output on unseen data (the test set).
# Use the trained autoencoder to reconstruct images from the test set
reconstructed_imgs = autoencoder.predict(x_test)
# Display original and reconstructed images
n = 10 # Number of images to display
plt.figure(figsize=(20, 4))
for i in range(n):
# Display original
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
if i == 0:
ax.set_title("Original")
# Display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(reconstructed_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
if i == 0:
ax.set_title("Reconstructed")
plt.suptitle('Original vs. Reconstructed Images (Test Set)')
plt.show()
You should observe that the reconstructed images are somewhat blurry versions of the originals, capturing the main features but losing some fine details. This is expected, as the bottleneck layer forces the model to learn a compressed representation. The quality of reconstruction depends heavily on the model architecture, the size of the latent dimension, the dataset complexity, and the training duration.
This example provides a starting point for implementing CAEs. You can experiment further by:
UpSampling2D
followed by Conv2D
instead of Conv2DTranspose
in the decoder.latent_dim
affects reconstruction quality and the level of compression.x_train_noisy
) while keeping the clean images (x_train
) as the target. This often leads to more robust feature extraction.encoder
part of the model to extract compressed features from images for downstream tasks like classification.By building and experimenting with CAEs, you gain practical experience in designing architectures tailored for spatial data and understanding the trade-offs involved in representation learning through compression.
© 2025 ApX Machine Learning