Construct an autoencoder to extract features from tabular data. This involves taking a dataset with existing features, training the autoencoder to reconstruct it, and then utilizing the autoencoder's compressed internal representation (the bottleneck layer) as a new, lower-dimensional set of features. The process requires careful consideration of data preparation, network design, loss functions, and optimizers.We'll walk through the following steps:Setting up our environment and generating a synthetic dataset.Preparing the data: scaling and splitting.Designing and building the autoencoder model.Training the autoencoder.Extracting the learned features from the bottleneck layer.Visualizing the latent space.Briefly evaluating the utility of these extracted features in a simple classification task.This hands-on example will use PyTorch, but the principles are directly transferable to other frameworks.Setting Up and Generating DataFirst, ensure you have the necessary libraries installed. We'll primarily use torch, numpy, pandas, and scikit-learn.import numpy as np import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import matplotlib.pyplot as pltFor this exercise, we'll generate a synthetic dataset using scikit-learn. This gives us control over the data's characteristics and allows us to focus on the autoencoder mechanics. We'll create a dataset with 20 features. Our autoencoder will try to learn a compressed representation of these features.from sklearn.datasets import make_classification # Generate a synthetic dataset X_orig, y_orig = make_classification( n_samples=2000, n_features=20, # Original number of features n_informative=15, # Number of informative features n_redundant=3, # Number of redundant features n_repeated=0, # Number of duplicated features n_classes=2, # Number of classes for the target variable y n_clusters_per_class=2, flip_y=0.01, random_state=42 ) print(f"Original data shape: {X_orig.shape}") # Original data shape: (2000, 20)Here, X_orig contains our features, and y_orig is a binary class label. We'll use X_orig to train the autoencoder in an unsupervised manner (it won't see y_orig during training). Later, y_orig will be useful for visualizing the latent space and for the optional classification task.Data PreparationNeural networks, including autoencoders, generally perform better when input features are on a similar scale. We'll use MinMaxScaler from scikit-learn to scale our features to a range between 0 and 1.scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_orig) # Split data into training and testing sets # The autoencoder learns to reconstruct its input, so X is both input and target. X_train_np, X_test_np, y_train_np, y_test_np = train_test_split( X_scaled, y_orig, test_size=0.2, random_state=42, stratify=y_orig ) # Convert NumPy arrays to PyTorch tensors X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32) X_test_tensor = torch.tensor(X_test_np, dtype=torch.float32) y_train_tensor = torch.tensor(y_train_np, dtype=torch.long) y_test_tensor = torch.tensor(y_test_np, dtype=torch.long) # Create TensorDatasets and DataLoaders batch_size = 32 train_dataset = TensorDataset(X_train_tensor, X_train_tensor) # Input and target are the same for AE test_dataset = TensorDataset(X_test_tensor, X_test_tensor) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) print(f"Training data shape (PyTorch tensor): {X_train_tensor.shape}") print(f"Test data shape (PyTorch tensor): {X_test_tensor.shape}") # Training data shape (PyTorch tensor): torch.Size([1600, 20]) # Test data shape (PyTorch tensor): torch.Size([400, 20])Scaling ensures that all features contribute more evenly to the learning process and helps the optimizer converge faster. The Sigmoid activation function in the decoder's output layer is well-suited for data scaled to [0, 1].Designing the AutoencoderWe'll build a simple, fully-connected autoencoder. The architecture will consist of an encoder that compresses the input down to a small latent space, and a decoder that reconstructs the input from this latent representation.Let's define the dimensionality of our bottleneck layer. This is a critical hyperparameter. A smaller bottleneck forces more aggressive compression and potentially more abstract feature learning. For this example, let's aim for a 2-dimensional latent space, which will be easy to visualize.input_dim = X_train_tensor.shape[1] # Number of features, which is 20 latent_dim = 2 # Dimensionality of the bottleneck class Autoencoder(nn.Module): def __init__(self, input_dim, latent_dim): super(Autoencoder, self).__init__() # Encoder self.encoder = nn.Sequential( nn.Linear(input_dim, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, latent_dim), nn.ReLU() # The bottleneck layer ) # Decoder self.decoder = nn.Sequential( nn.Linear(latent_dim, 64), nn.ReLU(), nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, input_dim), nn.Sigmoid() # Sigmoid for [0,1] output ) def forward(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded autoencoder_model = Autoencoder(input_dim, latent_dim) # Move model to GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") autoencoder_model.to(device) print(autoencoder_model)The printout of autoencoder_model shows the structure: encoder as a Sequential block that compresses 20 dimensions down to 2, and decoder as another Sequential block that expands it back to 20. The ReLU activation is used in hidden layers, and Sigmoid in the output layer because our input data is scaled between 0 and 1. If we used StandardScaler, a linear activation might be more appropriate for the output.Now, let's define the loss function and optimizer. We'll use the Adam optimizer and Mean Squared Error (MSELoss) as the loss function. MSE is suitable for reconstruction tasks with continuous (or scaled continuous) data.criterion = nn.MSELoss() optimizer = optim.Adam(autoencoder_model.parameters(), lr=0.001)Training the AutoencoderWe train the autoencoder by asking it to reconstruct the training data. The input and the target output are the same.epochs = 100 train_losses = [] val_losses = [] for epoch in range(epochs): # Training autoencoder_model.train() running_train_loss = 0.0 for data, _ in train_loader: # _ is the target, which is same as data data = data.to(device) optimizer.zero_grad() outputs = autoencoder_model(data) loss = criterion(outputs, data) loss.backward() optimizer.step() running_train_loss += loss.item() * data.size(0) epoch_train_loss = running_train_loss / len(train_loader.dataset) train_losses.append(epoch_train_loss) # Validation autoencoder_model.eval() running_val_loss = 0.0 with torch.no_grad(): for data, _ in test_loader: # _ is the target, which is same as data data = data.to(device) outputs = autoencoder_model(data) loss = criterion(outputs, data) running_val_loss += loss.item() * data.size(0) epoch_val_loss = running_val_loss / len(test_loader.dataset) val_losses.append(epoch_val_loss) print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {epoch_train_loss:.4f}, Val Loss: {epoch_val_loss:.4f}')After training, we can visualize the training and validation loss to check for overfitting and to see if the model has converged.{"data":[{"x":[1,10,20,30,40,50,60,70,80,90,100],"y":[0.12,0.06,0.045,0.038,0.035,0.033,0.031,0.030,0.029,0.0285,0.028],"type":"scatter","mode":"lines+markers","name":"Training Loss","line":{"color":"#228be6"}},{"x":[1,10,20,30,40,50,60,70,80,90,100],"y":[0.11,0.058,0.044,0.037,0.0345,0.0325,0.0305,0.0295,0.0288,0.0283,0.0278],"type":"scatter","mode":"lines+markers","name":"Validation Loss","line":{"color":"#fd7e14"}}],"layout":{"title":{"text":"Autoencoder Training and Validation Loss"},"xaxis":{"title":{"text":"Epoch"}},"yaxis":{"title":{"text":"Mean Squared Error (MSE)"}},"width":600,"height":400}}Autoencoder training progress, showing decreasing MSE loss on both training and validation sets over epochs. Ideally, both losses decrease and converge.A good autoencoder will have a low reconstruction loss on the validation set.Extracting Features from the BottleneckOnce the autoencoder is trained, the encoder part (autoencoder_model.encoder) can be used as a feature extractor. We feed our original (scaled) data through the encoder to obtain the compressed, latent representations.autoencoder_model.eval() # Set model to evaluation mode with torch.no_grad(): # Disable gradient calculations X_train_encoded = autoencoder_model.encoder(X_train_tensor.to(device)).cpu().numpy() X_test_encoded = autoencoder_model.encoder(X_test_tensor.to(device)).cpu().numpy() print(f"Original training data shape: {X_train_np.shape}") print(f"Encoded training data shape: {X_train_encoded.shape}") # Original training data shape: (1600, 20) # Encoded training data shape: (1600, 2) print(f"Original test data shape: {X_test_np.shape}") print(f"Encoded test data shape: {X_test_encoded.shape}") # Original test data shape: (400, 20) # Encoded test data shape: (400, 2)As you can see, we've successfully reduced the dimensionality of our data from 20 features down to 2 features. These 2 features are learned by the autoencoder and capture the most salient information needed to reconstruct the original data.Visualizing the Latent SpaceSince we chose latent_dim = 2, we can easily visualize the extracted features in a 2D scatter plot. We can color the points using the original class labels (y_train_np or y_test_np) to see if the autoencoder has learned a representation that separates the classes, even though it wasn't trained explicitly for classification.plt.figure(figsize=(8, 6)) plt.scatter(X_test_encoded[y_test_np == 0, 0], X_test_encoded[y_test_np == 0, 1], label='Class 0', alpha=0.7, c='#4263eb') plt.scatter(X_test_encoded[y_test_np == 1, 0], X_test_encoded[y_test_np == 1, 1], label='Class 1', alpha=0.7, c='#f06595') plt.title('2D Latent Space Visualization') plt.xlabel('Latent Dimension 1') plt.ylabel('Latent Dimension 2') plt.legend(title='Original Class') plt.grid(True) plt.show()Scatter plot of the 2D latent space representations from the test set, colored by their original class labels. Clusters or separation might indicate meaningful feature learning. (Note: The data points above are illustrative placeholders.)If the plot shows some separation or clustering based on the original classes, it suggests that the autoencoder has learned features that are relevant to the underlying structure of the data.Evaluating Feature Utility (A Quick Check)The ultimate test of extracted features is their performance on a downstream task. Let's quickly train a simple Logistic Regression classifier on:The original (scaled) 20 features.The autoencoder's 2 extracted features.We'll compare their accuracy on the test set.# 1. Classifier on original features lr_original = LogisticRegression(solver='liblinear', random_state=42) lr_original.fit(X_train_np, y_train_np) y_pred_original = lr_original.predict(X_test_np) accuracy_original = accuracy_score(y_test_np, y_pred_original) print(f"Accuracy with original {X_train_np.shape[1]} features: {accuracy_original:.4f}") # 2. Classifier on autoencoder-extracted features lr_encoded = LogisticRegression(solver='liblinear', random_state=42) lr_encoded.fit(X_train_encoded, y_train_np) # Use the 2D encoded features y_pred_encoded = lr_encoded.predict(X_test_encoded) accuracy_encoded = accuracy_score(y_test_np, y_pred_encoded) print(f"Accuracy with autoencoder's {X_train_encoded.shape[1]} features: {accuracy_encoded:.4f}") # Example Output: # Accuracy with original 20 features: 0.8850 # Accuracy with autoencoder's 2 features: 0.8725In this output, the accuracy with 2 features is slightly lower than with 20 features. This is not unexpected, as some information might be lost during compression. However, achieving comparable performance with significantly fewer features (2 vs. 20) demonstrates the power of autoencoders for dimensionality reduction. In some cases, especially with noisy or highly redundant original features, autoencoder-extracted features can even lead to improved performance by acting as a denoising or regularization mechanism. The trade-off between dimensionality reduction and performance is a common theme.Summary of this Hands-on SessionIn this hands-on section, we successfully:Generated a synthetic tabular dataset.Preprocessed the data using scaling and splitting.Designed, built, and trained a simple autoencoder with a 2-dimensional bottleneck in PyTorch.Used the trained encoder to extract low-dimensional features from our data.Visualized these features in the latent space.Briefly evaluated their utility by feeding them into a simple downstream classification model.This practical exercise provides a foundational workflow for using autoencoders for feature extraction. You can adapt these steps to your own tabular datasets. Experimenting with the number of layers, neurons per layer, latent space dimensionality, and other hyperparameters (as discussed earlier in this chapter and will be further explored in Chapter 7) is crucial to achieving optimal results for your specific problem.The next chapters will explore more advanced autoencoder architectures like Denoising Autoencoders, Convolutional Autoencoders for image data, and Variational Autoencoders, further expanding your toolkit for powerful feature extraction.