All Courses

Implementing Sparse Autoencoders: Hands-on Practical

As discussed earlier in this chapter, sparse autoencoders aim to learn compressed representations by encouraging sparsity in the activations of the hidden (bottleneck) layer. This forces the network to use only a small subset of hidden units for any given input, potentially leading to more specialized feature detectors. Let's put this theory into practice by implementing sparse autoencoders using two common techniques: L1 regularization and KL divergence penalty.

We will use TensorFlow with the Keras API for these examples. Ensure you have TensorFlow installed (pip install tensorflow). We'll work with the Fashion-MNIST dataset, a slightly more challenging alternative to MNIST.

Setting Up the Environment and Data

First, let's import the necessary libraries and load the dataset.

import tensorflow as tf
from tensorflow.keras import layers, models, regularizers, losses, backend as K
import numpy as np
import matplotlib.pyplot as plt

# Load Fashion-MNIST dataset
(x_train, _), (x_test, _) = tf.keras.datasets.fashion_mnist.load_data()

# Normalize and reshape data (flatten images)
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

print(f"Training data shape: {x_train.shape}")
print(f"Test data shape: {x_test.shape}")

# Define input shape and encoding dimension
input_dim = x_train.shape[1] # 784 for Fashion-MNIST
encoding_dim = 64 # Size of the bottleneck layer

Sparse Autoencoder with L1 Activity Regularization

The most straightforward way to encourage sparsity is by adding a penalty to the loss function that is proportional to the L1 norm (sum of absolute values) of the bottleneck layer's activations. Keras provides a convenient way to do this using activity_regularizer on the bottleneck layer.

The total loss becomes:

\text{Loss} = \text{Reconstruction Loss} + \lambda \sum_{i} |h_i|

where $h_i$ are the activations of the bottleneck layer units and $\lambda$ is the regularization strength parameter.

Let's define the model architecture.

# L1 Regularization strength
l1_lambda = 1e-5 # This is a hyperparameter to tune

# Define the input layer
input_img = layers.Input(shape=(input_dim,))

# Define the encoder with L1 activity regularization on the bottleneck
encoded = layers.Dense(128, activation='relu')(input_img)
encoded = layers.Dense(encoding_dim, activation='relu',
                       activity_regularizer=regularizers.l1(l1_lambda))(encoded) # Apply L1 here

# Define the decoder
decoded = layers.Dense(128, activation='relu')(encoded)
decoded = layers.Dense(input_dim, activation='sigmoid')(decoded) # Sigmoid for pixel values [0, 1]

# Build the autoencoder model
autoencoder_l1 = models.Model(input_img, decoded)

# Compile the model
autoencoder_l1.compile(optimizer='adam', loss='binary_crossentropy') # BCE suitable for [0,1] pixel values

autoencoder_l1.summary()

Now, train the model. We don't need the labels (y_train, y_test), as autoencoders are unsupervised.

# Training parameters
epochs = 30
batch_size = 256

# Train the autoencoder
history_l1 = autoencoder_l1.fit(x_train, x_train, # Input and target are the same
                                epochs=epochs,
                                batch_size=batch_size,
                                shuffle=True,
                                validation_data=(x_test, x_test),
                                verbose=1) # Set verbose=2 for less output per epoch

print("Training complete.")

After training, you can inspect the reconstructions and, more importantly for sparsity, examine the activations in the bottleneck layer.

# Build the encoder model separately to get bottleneck activations
encoder_l1 = models.Model(input_img, encoded)

# Get bottleneck activations for test data
encoded_imgs_l1 = encoder_l1.predict(x_test)

# Calculate and print average activation value
print(f"Average activation in L1 bottleneck: {np.mean(encoded_imgs_l1):.4f}")

# Visualize average activation per neuron
avg_activations_l1 = np.mean(encoded_imgs_l1, axis=0)

plt.figure(figsize=(10, 4))
plt.bar(range(encoding_dim), avg_activations_l1)
plt.title('Average Activation per Neuron (L1 Regularization)')
plt.xlabel('Neuron Index')
plt.ylabel('Average Activation')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

You should observe that many neurons have very low average activations, indicating that the L1 penalty successfully induced sparsity. The value of l1_lambda influences the degree of sparsity; higher values lead to sparser representations but might hurt reconstruction quality if set too high.

Sparse Autoencoder with KL Divergence Regularization

Another approach is to enforce sparsity by adding a KL divergence term to the loss function. This term measures the difference between the desired average activation of hidden units (a small value $\rho$ , e.g., 0.05) and the actual average activation observed over the training batch ( $\hat{\rho}_j$ for neuron $j$ ).

The KL divergence penalty for a single neuron $j$ is:

\text{KL}(\rho || \hat{\rho}_j) = \rho \log \frac{\rho}{\hat{\rho}_j} + (1-\rho) \log \frac{1-\rho}{1-\hat{\rho}_j}

The total sparsity penalty added to the loss is the sum over all bottleneck neurons, weighted by a parameter $\beta$ :

\text{Loss} = \text{Reconstruction Loss} + \beta \sum_{j=1}^{\text{encoding\_dim}} \text{KL}(\rho || \hat{\rho}_j)

Implementing this typically requires a custom layer or modifying the training loop to calculate $\hat{\rho}_j$ and add the KL term. Here's how you can define a custom regularizer in Keras.

# Sparsity parameters
rho = 0.05  # Target sparsity
beta = 3    # Sparsity weight

# Custom KL divergence regularizer
class KLDivergenceRegularizer(regularizers.Regularizer):
    def __init__(self, rho, beta):
        self.rho = rho
        self.beta = beta

    def __call__(self, activations):
        # Calculate average activation across the batch
        # K.mean computes mean along axis=0 (batch dimension)
        rho_hat = K.mean(activations, axis=0)
        # Compute KL divergence
        kl_divergence = self.rho * K.log(self.rho / rho_hat + K.epsilon()) + \
                        (1 - self.rho) * K.log((1 - self.rho) / (1 - rho_hat) + K.epsilon())
        # Return the scaled sum over bottleneck neurons
        return self.beta * K.sum(kl_divergence)

    def get_config(self):
        return {'rho': float(self.rho), 'beta': float(self.beta)}

# Define the model architecture using the KL regularizer
input_img_kl = layers.Input(shape=(input_dim,))

encoded_kl = layers.Dense(128, activation='relu')(input_img_kl)
# Apply KL regularizer to the bottleneck activations
encoded_kl = layers.Dense(encoding_dim, activation='sigmoid', # Sigmoid often used here for KL [0,1] range
                          activity_regularizer=KLDivergenceRegularizer(rho, beta))(encoded_kl)

decoded_kl = layers.Dense(128, activation='relu')(encoded_kl)
decoded_kl = layers.Dense(input_dim, activation='sigmoid')(decoded_kl)

autoencoder_kl = models.Model(input_img_kl, decoded_kl)

# Compile the model (ensure loss is appropriate, e.g., BCE)
autoencoder_kl.compile(optimizer='adam', loss='binary_crossentropy')

autoencoder_kl.summary()

# Train the KL-regularized autoencoder
print("\nTraining KL Divergence Sparse Autoencoder...")
history_kl = autoencoder_kl.fit(x_train, x_train,
                                epochs=epochs,
                                batch_size=batch_size,
                                shuffle=True,
                                validation_data=(x_test, x_test),
                                verbose=1)

print("Training complete.")

Note the use of activation='sigmoid' in the KL-regularized bottleneck layer. This is common because the KL divergence formula assumes activations $\hat{\rho}_j$ are between 0 and 1, which sigmoid ensures. If using ReLU, activations could exceed 1, potentially causing issues with the log terms in the KL formula.

Now, let's evaluate the sparsity achieved with KL divergence.

# Build the corresponding encoder
encoder_kl = models.Model(input_img_kl, encoded_kl)

# Get bottleneck activations
encoded_imgs_kl = encoder_kl.predict(x_test)

# Calculate and print average activation value
print(f"Average activation in KL bottleneck: {np.mean(encoded_imgs_kl):.4f}")

# Visualize average activation per neuron
avg_activations_kl = np.mean(encoded_imgs_kl, axis=0)

plt.figure(figsize=(10, 4))
plt.bar(range(encoding_dim), avg_activations_kl)
plt.axhline(rho, color='r', linestyle='--', label=f'Target Sparsity rho={rho}')
plt.title('Average Activation per Neuron (KL Divergence)')
plt.xlabel('Neuron Index')
plt.ylabel('Average Activation')
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Visualize a histogram of all bottleneck activations from the test set
plt.figure(figsize=(8, 5))
plt.hist(encoded_imgs_kl.flatten(), bins=50, color='#4dabf7', alpha=0.8)
plt.title('Histogram of KL Bottleneck Activations (Test Set)')
plt.xlabel('Activation Value')
plt.ylabel('Frequency')
plt.yscale('log') # Use log scale to see low activation frequencies better
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Comparison of average activations per neuron for L1 and KL divergence sparse autoencoders on the Fashion-MNIST test set. KL divergence aims for a specific target average activation (e.g., 0.05), while L1 encourages activations towards zero without a fixed target.

The KL divergence approach attempts to force the average activation of each hidden unit across the batch towards the target $\rho$ . The histogram often shows a peak near zero and possibly another small peak near one (if using sigmoid activation), with most activations being very small. The beta parameter controls the strength of this sparsity constraint relative to the reconstruction loss.

Conclusion

This practical exercise demonstrated how to implement sparse autoencoders using L1 and KL divergence regularization in TensorFlow/Keras. Both methods effectively encourage sparsity in the bottleneck layer, forcing the network to learn more compressed and potentially more meaningful features compared to a standard autoencoder. The choice between L1 and KL divergence, along with tuning their respective hyperparameters ( $\lambda$ or $\rho$ and $\beta$ ), depends on the specific dataset and task requirements. Experimenting with these parameters is necessary to find a balance between achieving good reconstruction quality and enforcing the desired level of sparsity. These regularized models often provide representations that are more robust and better suited for downstream tasks.

Was this section helpful?