Applying autoencoder features to a concrete classification problem helps understand how features learned by an autoencoder can impact the performance of a downstream supervised learning model. The process involves training a baseline classifier, then an autoencoder to extract features, and finally, a classifier using these new features, comparing the results.Setting the Stage: The Dataset and Baseline ModelFor this exercise, we'll use a common dataset that's suitable for classification and where feature extraction might offer some benefits. Let's consider the "Digits" dataset, available in scikit-learn, which consists of 8x8 pixel images of handwritten digits (0-9). Each image is represented as a 64-dimensional vector. Our goal is to classify these digits.First, we need to establish a baseline. This involves training a standard classification model on the original, raw features of the dataset. This baseline will serve as a point of comparison to evaluate whether using autoencoder-extracted features provides any advantage.Load and Prepare Data: We'll load the Digits dataset and split it into training and testing sets. It's also good practice to scale the features, typically to a [0, 1] range, which helps in training neural networks (including autoencoders) and many classifiers.from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load data digits = load_digits() X, y = digits.data, digits.target # Scale features to [0, 1] scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Split data X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.3, random_state=42, stratify=y )Train Baseline Classifier: We'll use a simple Logistic Regression model as our baseline classifier.# Train baseline Logistic Regression model baseline_model = LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42, max_iter=1000) baseline_model.fit(X_train, y_train) # Evaluate baseline model y_pred_baseline = baseline_model.predict(X_test) baseline_accuracy = accuracy_score(y_test, y_pred_baseline) print(f"Baseline Logistic Regression Accuracy: {baseline_accuracy:.4f}")Let's assume this gives us an accuracy of, say, 0.9556. This is the score we'll try to match or improve upon using autoencoder features, potentially with a more compact feature set.Building an Autoencoder for Feature ExtractionNow, we'll design and train an autoencoder using PyTorch. The encoder part of this autoencoder will learn to transform the 64-dimensional input into a lower-dimensional representation.Autoencoder Architecture: We'll construct a simple, fully-connected autoencoder. The dimensionality of the latent space is an important hyperparameter. Let's try reducing the 64 dimensions to, for example, 32.import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import TensorDataset, DataLoader # Convert numpy arrays to PyTorch tensors X_train_tensor = torch.tensor(X_train, dtype=torch.float32) X_test_tensor = torch.tensor(X_test, dtype=torch.float32) # Create TensorDatasets and DataLoaders train_dataset = TensorDataset(X_train_tensor, X_train_tensor) # Autoencoder takes input as target test_dataset = TensorDataset(X_test_tensor, X_test_tensor) batch_size = 32 train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) input_dim = X_train.shape[1] # Should be 64 latent_dim = 32 # Our chosen latent space dimensionality # Define the Autoencoder model class Autoencoder(nn.Module): def __init__(self, input_dim, latent_dim): super(Autoencoder, self).__init__() # Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, latent_dim), # Bottleneck layer nn.ReLU() # ReLU here for non-negative features, common in some AE uses ) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, 64), nn.ReLU(), nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, input_dim), nn.Sigmoid() # Sigmoid for [0,1] scaled data ) def forward(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded autoencoder = Autoencoder(input_dim, latent_dim) # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") autoencoder.to(device) # Define loss function and optimizer criterion = nn.MSELoss() # MSE for reconstruction task optimizer = optim.Adam(autoencoder.parameters(), lr=1e-3) print(autoencoder)Here, nn.Sigmoid() is used in the final decoder layer because our input data X_scaled is normalized to the [0, 1] range. nn.MSELoss() (mean squared error) is a common loss function for reconstruction tasks with continuous-valued input. We also define the encoder part as a separate sequential module within the Autoencoder class for easy extraction later.Train the Autoencoder: We train the autoencoder to reconstruct the input data.epochs = 50 history = {'loss': [], 'val_loss': []} for epoch in range(epochs): # Training autoencoder.train() train_loss = 0 for batch_X, _ in train_loader: # _ is the target, which is identical to input for AE batch_X = batch_X.to(device) optimizer.zero_grad() reconstruction = autoencoder(batch_X) loss = criterion(reconstruction, batch_X) loss.backward() optimizer.step() train_loss += loss.item() * batch_X.size(0) # Accumulate sum of losses avg_train_loss = train_loss / len(train_loader.dataset) history['loss'].append(avg_train_loss) # Validation autoencoder.eval() val_loss = 0 with torch.no_grad(): for batch_X_test, _ in test_loader: batch_X_test = batch_X_test.to(device) reconstruction = autoencoder(batch_X_test) loss = criterion(reconstruction, batch_X_test) val_loss += loss.item() * batch_X_test.size(0) avg_val_loss = val_loss / len(test_loader.dataset) history['val_loss'].append(avg_val_loss) print(f"Epoch {epoch+1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}") print("Autoencoder training complete.")Monitoring the validation loss is important to prevent overfitting. If val_loss starts increasing while loss decreases, it's a sign of overfitting.Extracting Features and Training the ClassifierWith the autoencoder trained, we can now use its encoder part to transform our original training and testing datasets into their latent representations.Extract Features: Use the encoder module from our trained autoencoder to get the compressed features.autoencoder.eval() # Set autoencoder to evaluation mode with torch.no_grad(): X_train_encoded = autoencoder.encoder(X_train_tensor.to(device)).cpu().numpy() X_test_encoded = autoencoder.encoder(X_test_tensor.to(device)).cpu().numpy() print(f"Original feature shape: {X_train.shape}") print(f"Encoded feature shape: {X_train_encoded.shape}")This should show that the number of features has been reduced from 64 to latent_dim (32 in our example).Train Classifier on Extracted Features: Now, we train the same Logistic Regression classifier, but this time using X_train_encoded and X_test_encoded.# Train Logistic Regression model on encoded features ae_feature_model = LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42, max_iter=1000) ae_feature_model.fit(X_train_encoded, y_train) # Evaluate model on encoded features y_pred_ae_features = ae_feature_model.predict(X_test_encoded) ae_features_accuracy = accuracy_score(y_test, y_pred_ae_features) print(f"Logistic Regression with Autoencoder Features Accuracy: {ae_features_accuracy:.4f}")Comparing Performance and DiscussionLet's say our classifier using autoencoder features achieves an accuracy of 0.9611.Here's a simple comparison:Baseline Accuracy (64 features): 0.9556AE Features Accuracy (32 features): 0.9611{"data": [{"x": ["Baseline (64 Features)", "AE Features (32 Features)"], "y": [0.9556, 0.9611], "type": "bar", "marker": {"color": ["#228be6", "#40c057"]}}], "layout": {"title": "Classifier Accuracy Comparison", "yaxis": {"title": "Accuracy", "range": [0.9, 1.0]}, "xaxis": {"title": "Feature Set"}}}Classifier accuracy using original features versus features extracted by an autoencoder.In this scenario, we achieved a slight improvement in accuracy while halving the number of features. This is a positive outcome. The autoencoder might have learned a more discriminative or noise-reduced representation of the data, beneficial for the classifier.Potential Outcomes and Considerations:Improved Performance: As seen above, the latent features might be more salient for the classification task. The autoencoder can act as a non-linear feature learner, capturing complex relationships that a linear model like Logistic Regression might miss on its own with raw features.Comparable Performance with Reduced Dimensionality: Even if the accuracy is similar (e.g., 0.9550 with AE features), achieving this with fewer features (32 vs. 64) is often valuable. It can lead to faster training times for the downstream classifier, reduced storage, and potentially better generalization if the original features had high redundancy.Worse Performance: If the accuracy drops significantly (e.g., to 0.9200), it could indicate:Poor Autoencoder Training: The autoencoder might not have been trained sufficiently, or its architecture (including latent dimension) might not be optimal. The reconstruction loss should be reasonably low.Information Loss: The chosen latent_dim might be too small, causing critical information for classification to be lost during compression.Suboptimal Autoencoder Design: Perhaps a different type of autoencoder (e.g., Denoising Autoencoder if the data is noisy, or a deeper architecture) would yield better features.Hyperparameter Tuning: The learning rate, number of epochs, batch size for the autoencoder, and the architecture itself (number of layers, neurons per layer) are all hyperparameters that might need tuning, as discussed earlier in this chapter.Further Steps You Can Take:Vary Latent Dimensionality: Experiment with different sizes for latent_dim. A very small latent dimension might lead to information loss, while a very large one might not offer much compression or feature learning benefit. Plotting classifier performance against latent dimension size can be very insightful.Different Classifiers: Try using the extracted features with other types of classifiers (e.g., Support Vector Machines, Random Forests, or even a small Multi-Layer Perceptron) to see if the benefits translate across different models.Evaluate Feature Quality: In downstream task performance, you could also try to visualize the latent space (if latent_dim is 2 or 3) to see if digits of the same class cluster together.Advanced Autoencoders: For more complex datasets, especially images, you would typically use Convolutional Autoencoders. For noisy data, Denoising Autoencoders could be beneficial. If you suspect the latent space needs more structure, Variational Autoencoders (VAEs) could be explored, though their features (often the mean vector $\mu$) are used similarly.This practice session demonstrates the end-to-end workflow of employing autoencoders for feature extraction in a supervised learning context. The important thing is to remember that the autoencoder is a tool; its effectiveness depends on careful design, training, and evaluation in the context of your specific problem and data. The features it extracts are not guaranteed to be better, but they offer a powerful way to transform your data, often leading to more compact and informative representations.