Practice: Autoencoder Features for Classification

Throughout this chapter, we've discussed the methods for selecting, tuning, and evaluating autoencoders for feature extraction. Now, let's bring these ideas together and apply them to a concrete classification problem. The objective here is not just to build an autoencoder, but to see how the features it learns can impact the performance of a downstream supervised learning model. We'll walk through the process of training a baseline classifier, then an autoencoder to extract features, and finally, a classifier using these new features, comparing the results along the way.

Setting the Stage: The Dataset and Baseline Model

For this exercise, we'll use a common dataset that's suitable for classification and where feature extraction might offer some benefits. Let's consider the "Digits" dataset, available in scikit-learn, which consists of 8x8 pixel images of handwritten digits (0-9). Each image is represented as a 64-dimensional vector. Our goal is to classify these digits.

First, we need to establish a baseline. This involves training a standard classification model on the original, raw features of the dataset. This baseline will serve as a point of comparison to evaluate whether using autoencoder-extracted features provides any advantage.

Load and Prepare Data: We'll load the Digits dataset and split it into training and testing sets. It's also good practice to scale the features, typically to a [0, 1] range, which helps in training neural networks (including autoencoders) and many classifiers.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
digits = load_digits()
X, y = digits.data, digits.target

# Scale features to [0, 1]
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

Train Baseline Classifier: We'll use a simple Logistic Regression model as our baseline classifier.

# Train baseline Logistic Regression model
baseline_model = LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42, max_iter=1000)
baseline_model.fit(X_train, y_train)

# Evaluate baseline model
y_pred_baseline = baseline_model.predict(X_test)
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline Logistic Regression Accuracy: {baseline_accuracy:.4f}")

Let's assume this gives us an accuracy of, say, 0.9556. This is the score we'll try to match or improve upon using autoencoder features, potentially with a more compact feature set.

Building an Autoencoder for Feature Extraction

Now, we'll design and train an autoencoder using PyTorch. The encoder part of this autoencoder will learn to transform the 64-dimensional input into a lower-dimensional representation.

Autoencoder Architecture: We'll construct a simple, fully-connected autoencoder. The dimensionality of the latent space is a key hyperparameter. Let's try reducing the 64 dimensions to, for example, 32.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Convert numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, X_train_tensor) # Autoencoder takes input as target
test_dataset = TensorDataset(X_test_tensor, X_test_tensor)

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

input_dim = X_train.shape[1] # Should be 64
latent_dim = 32 # Our chosen latent space dimensionality

# Define the Autoencoder model
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super(Autoencoder, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim), # Bottleneck layer
            nn.ReLU() # ReLU here for non-negative features, common in some AE uses
        )
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim),
            nn.Sigmoid() # Sigmoid for [0,1] scaled data
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

autoencoder = Autoencoder(input_dim, latent_dim)
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
autoencoder.to(device)

# Define loss function and optimizer
criterion = nn.MSELoss() # MSE for reconstruction task
optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3)

print(autoencoder)

Here, nn.Sigmoid() is used in the final decoder layer because our input data X_scaled is normalized to the [0, 1] range. nn.MSELoss() (mean squared error) is a common loss function for reconstruction tasks with continuous-valued input. We also define the encoder part as a separate sequential module within the Autoencoder class for easy extraction later.

Train the Autoencoder: We train the autoencoder to reconstruct the input data.

epochs = 50
history = {'loss': [], 'val_loss': []}

for epoch in range(epochs):
    # Training
    autoencoder.train()
    train_loss = 0
    for batch_X, _ in train_loader: # _ is the target, which is identical to input for AE
        batch_X = batch_X.to(device)
        optimizer.zero_grad()
        reconstruction = autoencoder(batch_X)
        loss = criterion(reconstruction, batch_X)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * batch_X.size(0) # Accumulate sum of losses

    avg_train_loss = train_loss / len(train_loader.dataset)
    history['loss'].append(avg_train_loss)

    # Validation
    autoencoder.eval()
    val_loss = 0
    with torch.no_grad():
        for batch_X_test, _ in test_loader:
            batch_X_test = batch_X_test.to(device)
            reconstruction = autoencoder(batch_X_test)
            loss = criterion(reconstruction, batch_X_test)
            val_loss += loss.item() * batch_X_test.size(0)

    avg_val_loss = val_loss / len(test_loader.dataset)
    history['val_loss'].append(avg_val_loss)

    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

print("Autoencoder training complete.")

Monitoring the validation loss is important to prevent overfitting. If val_loss starts increasing while loss decreases, it's a sign of overfitting.

Extracting Features and Training the Classifier

With the autoencoder trained, we can now use its encoder part to transform our original training and testing datasets into their latent representations.

Extract Features: Use the encoder module from our trained autoencoder to get the compressed features.

autoencoder.eval() # Set autoencoder to evaluation mode
with torch.no_grad():
    X_train_encoded = autoencoder.encoder(X_train_tensor.to(device)).cpu().numpy()
    X_test_encoded = autoencoder.encoder(X_test_tensor.to(device)).cpu().numpy()

print(f"Original feature shape: {X_train.shape}")
print(f"Encoded feature shape: {X_train_encoded.shape}")

This should show that the number of features has been reduced from 64 to latent_dim (32 in our example).

Train Classifier on Extracted Features: Now, we train the same Logistic Regression classifier, but this time using X_train_encoded and X_test_encoded.

# Train Logistic Regression model on encoded features
ae_feature_model = LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42, max_iter=1000)
ae_feature_model.fit(X_train_encoded, y_train)

# Evaluate model on encoded features
y_pred_ae_features = ae_feature_model.predict(X_test_encoded)
ae_features_accuracy = accuracy_score(y_test, y_pred_ae_features)
print(f"Logistic Regression with Autoencoder Features Accuracy: {ae_features_accuracy:.4f}")

Comparing Performance and Discussion

Let's say our classifier using autoencoder features achieves an accuracy of 0.9611.

Here's a simple comparison:

Baseline Accuracy (64 features): 0.9556
AE Features Accuracy (32 features): 0.9611

Classifier accuracy using original features versus features extracted by an autoencoder.

In this hypothetical scenario, we achieved a slight improvement in accuracy while halving the number of features. This is a positive outcome. The autoencoder might have learned a more discriminative or noise-reduced representation of the data, beneficial for the classifier.

Potential Outcomes and Considerations:

Improved Performance: As seen above, the latent features might be more salient for the classification task. The autoencoder can act as a non-linear feature learner, capturing complex relationships that a linear model like Logistic Regression might miss on its own with raw features.
Comparable Performance with Reduced Dimensionality: Even if the accuracy is similar (e.g., 0.9550 with AE features), achieving this with fewer features (32 vs. 64) is often valuable. It can lead to faster training times for the downstream classifier, reduced storage, and potentially better generalization if the original features had high redundancy.
Worse Performance: If the accuracy drops significantly (e.g., to 0.9200), it could indicate:
- Poor Autoencoder Training: The autoencoder might not have been trained sufficiently, or its architecture (including latent dimension) might not be optimal. The reconstruction loss should be reasonably low.
- Information Loss: The chosen latent_dim might be too small, causing critical information for classification to be lost during compression.
- Suboptimal Autoencoder Design: Perhaps a different type of autoencoder (e.g., Denoising Autoencoder if the data is noisy, or a deeper architecture) would yield better features.
- Hyperparameter Tuning: The learning rate, number of epochs, batch size for the autoencoder, and the architecture itself (number of layers, neurons per layer) are all hyperparameters that might need tuning, as discussed earlier in this chapter.

Further Steps You Can Take:

Vary Latent Dimensionality: Experiment with different sizes for latent_dim. A very small latent dimension might lead to information loss, while a very large one might not offer much compression or feature learning benefit. Plotting classifier performance against latent dimension size can be very insightful.
Different Classifiers: Try using the extracted features with other types of classifiers (e.g., Support Vector Machines, Random Forests, or even a small Multi-Layer Perceptron) to see if the benefits translate across different models.
Evaluate Feature Quality: Beyond downstream task performance, you could also try to visualize the latent space (if latent_dim is 2 or 3) to see if digits of the same class cluster together.
Advanced Autoencoders: For more complex datasets, especially images, you would typically use Convolutional Autoencoders. For noisy data, Denoising Autoencoders could be beneficial. If you suspect the latent space needs more structure, Variational Autoencoders (VAEs) could be explored, though their features (often the mean vector $\mu$ ) are used similarly.

This practice session demonstrates the end-to-end workflow of employing autoencoders for feature extraction in a supervised learning context. The key is to remember that the autoencoder is a tool; its effectiveness depends on careful design, training, and evaluation in the context of your specific problem and data. The features it extracts are not guaranteed to be better, but they offer a powerful way to transform your data, often leading to more compact and informative representations.

Was this section helpful?