All Courses

Hands-on: Feature Extraction from Tabular Data

Alright, let's build an autoencoder to extract features from tabular data. In the previous sections, we've discussed the theory: data preparation, network design, loss functions, and optimizers. Now, we'll put all that into practice. Our goal is to take a dataset with a certain number of features, train an autoencoder to reconstruct it, and then use the autoencoder's compressed internal representation (the bottleneck layer) as a new, lower-dimensional set of features.

We'll walk through the following steps:

Setting up our environment and generating a synthetic dataset.
Preparing the data: scaling and splitting.
Designing and building the autoencoder model.
Training the autoencoder.
Extracting the learned features from the bottleneck layer.
Visualizing the latent space.
Briefly evaluating the utility of these extracted features in a simple classification task.

This hands-on example will use PyTorch, but the principles are directly transferable to other frameworks.

Setting Up and Generating Data

First, ensure you have the necessary libraries installed. We'll primarily use torch, numpy, pandas, and scikit-learn.

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

For this exercise, we'll generate a synthetic dataset using scikit-learn. This gives us control over the data's characteristics and allows us to focus on the autoencoder mechanics. We'll create a dataset with 20 features. Our autoencoder will try to learn a compressed representation of these features.

from sklearn.datasets import make_classification

# Generate a synthetic dataset
X_orig, y_orig = make_classification(
    n_samples=2000,
    n_features=20,        # Original number of features
    n_informative=15,    # Number of informative features
    n_redundant=3,        # Number of redundant features
    n_repeated=0,        # Number of duplicated features
    n_classes=2,        # Number of classes for the target variable y
    n_clusters_per_class=2,
    flip_y=0.01,
    random_state=42
)

print(f"Original data shape: {X_orig.shape}")
# Original data shape: (2000, 20)

Here, X_orig contains our features, and y_orig is a binary class label. We'll use X_orig to train the autoencoder in an unsupervised manner (it won't see y_orig during training). Later, y_orig will be useful for visualizing the latent space and for the optional classification task.

Data Preparation

Neural networks, including autoencoders, generally perform better when input features are on a similar scale. We'll use MinMaxScaler from scikit-learn to scale our features to a range between 0 and 1.

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_orig)

# Split data into training and testing sets
# The autoencoder learns to reconstruct its input, so X is both input and target.
X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(
    X_scaled, y_orig, test_size=0.2, random_state=42, stratify=y_orig
)

# Convert NumPy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_np, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_np, dtype=torch.long)
y_test_tensor = torch.tensor(y_test_np, dtype=torch.long)

# Create TensorDatasets and DataLoaders
batch_size = 32
train_dataset = TensorDataset(X_train_tensor, X_train_tensor) # Input and target are the same for AE
test_dataset = TensorDataset(X_test_tensor, X_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


print(f"Training data shape (PyTorch tensor): {X_train_tensor.shape}")
print(f"Test data shape (PyTorch tensor): {X_test_tensor.shape}")
# Training data shape (PyTorch tensor): torch.Size([1600, 20])
# Test data shape (PyTorch tensor): torch.Size([400, 20])

Scaling ensures that all features contribute more evenly to the learning process and helps the optimizer converge faster. The Sigmoid activation function in the decoder's output layer is well-suited for data scaled to [0, 1].

Designing the Autoencoder

We'll build a simple, fully-connected autoencoder. The architecture will consist of an encoder that compresses the input down to a small latent space, and a decoder that reconstructs the input from this latent representation.

Let's define the dimensionality of our bottleneck layer. This is a critical hyperparameter. A smaller bottleneck forces more aggressive compression and potentially more abstract feature learning. For this example, let's aim for a 2-dimensional latent space, which will be easy to visualize.

input_dim = X_train_tensor.shape[1]  # Number of features, which is 20
latent_dim = 2                      # Dimensionality of the bottleneck

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super(Autoencoder, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim),
            nn.ReLU() # The bottleneck layer
        )
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim),
            nn.Sigmoid() # Sigmoid for [0,1] output
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

autoencoder_model = Autoencoder(input_dim, latent_dim)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
autoencoder_model.to(device)

print(autoencoder_model)

The printout of autoencoder_model shows the structure: encoder as a Sequential block that compresses 20 dimensions down to 2, and decoder as another Sequential block that expands it back to 20. The ReLU activation is used in hidden layers, and Sigmoid in the output layer because our input data is scaled between 0 and 1. If we used StandardScaler, a linear activation might be more appropriate for the output.

Now, let's define the loss function and optimizer. We'll use the Adam optimizer and Mean Squared Error (MSELoss) as the loss function. MSE is suitable for reconstruction tasks with continuous (or scaled continuous) data.

criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder_model.parameters(), lr=0.001)

Training the Autoencoder

We train the autoencoder by asking it to reconstruct the training data. The input and the target output are the same.

epochs = 100
train_losses = []
val_losses = []

for epoch in range(epochs):
    # Training
    autoencoder_model.train()
    running_train_loss = 0.0
    for data, _ in train_loader: # _ is the target, which is same as data
        data = data.to(device)
        optimizer.zero_grad()
        outputs = autoencoder_model(data)
        loss = criterion(outputs, data)
        loss.backward()
        optimizer.step()
        running_train_loss += loss.item() * data.size(0)
    
    epoch_train_loss = running_train_loss / len(train_loader.dataset)
    train_losses.append(epoch_train_loss)

    # Validation
    autoencoder_model.eval()
    running_val_loss = 0.0
    with torch.no_grad():
        for data, _ in test_loader: # _ is the target, which is same as data
            data = data.to(device)
            outputs = autoencoder_model(data)
            loss = criterion(outputs, data)
            running_val_loss += loss.item() * data.size(0)
    
    epoch_val_loss = running_val_loss / len(test_loader.dataset)
    val_losses.append(epoch_val_loss)

    print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {epoch_train_loss:.4f}, Val Loss: {epoch_val_loss:.4f}')

After training, we can visualize the training and validation loss to check for overfitting and to see if the model has converged.

Autoencoder training progress, showing decreasing MSE loss on both training and validation sets over epochs. Ideally, both losses decrease and converge.

A good autoencoder will have a low reconstruction loss on the validation set.

Extracting Features from the Bottleneck

Once the autoencoder is trained, the encoder part (autoencoder_model.encoder) can be used as a feature extractor. We feed our original (scaled) data through the encoder to obtain the compressed, latent representations.

autoencoder_model.eval() # Set model to evaluation mode
with torch.no_grad(): # Disable gradient calculations
    X_train_encoded = autoencoder_model.encoder(X_train_tensor.to(device)).cpu().numpy()
    X_test_encoded = autoencoder_model.encoder(X_test_tensor.to(device)).cpu().numpy()

print(f"Original training data shape: {X_train_np.shape}")
print(f"Encoded training data shape: {X_train_encoded.shape}")
# Original training data shape: (1600, 20)
# Encoded training data shape: (1600, 2)

print(f"Original test data shape: {X_test_np.shape}")
print(f"Encoded test data shape: {X_test_encoded.shape}")
# Original test data shape: (400, 20)
# Encoded test data shape: (400, 2)

As you can see, we've successfully reduced the dimensionality of our data from 20 features down to 2 features. These 2 features are learned by the autoencoder and capture the most salient information needed to reconstruct the original data.

Visualizing the Latent Space

Since we chose latent_dim = 2, we can easily visualize the extracted features in a 2D scatter plot. We can color the points using the original class labels (y_train_np or y_test_np) to see if the autoencoder has learned a representation that separates the classes, even though it wasn't trained explicitly for classification.

plt.figure(figsize=(8, 6))
plt.scatter(X_test_encoded[y_test_np == 0, 0], X_test_encoded[y_test_np == 0, 1],
            label='Class 0', alpha=0.7, c='#4263eb')
plt.scatter(X_test_encoded[y_test_np == 1, 0], X_test_encoded[y_test_np == 1, 1],
            label='Class 1', alpha=0.7, c='#f06595')
plt.title('2D Latent Space Visualization')
plt.xlabel('Latent Dimension 1')
plt.ylabel('Latent Dimension 2')
plt.legend(title='Original Class')
plt.grid(True)
plt.show()

Scatter plot of the 2D latent space representations from the test set, colored by their original class labels. Clusters or separation might indicate meaningful feature learning. (Note: The data points above are illustrative placeholders.)

If the plot shows some separation or clustering based on the original classes, it suggests that the autoencoder has learned features that are relevant to the underlying structure of the data.

Evaluating Feature Utility (A Quick Check)

The ultimate test of extracted features is their performance on a downstream task. Let's quickly train a simple Logistic Regression classifier on:

The original (scaled) 20 features.
The autoencoder's 2 extracted features.

We'll compare their accuracy on the test set.

# 1. Classifier on original features
lr_original = LogisticRegression(solver='liblinear', random_state=42)
lr_original.fit(X_train_np, y_train_np)
y_pred_original = lr_original.predict(X_test_np)
accuracy_original = accuracy_score(y_test_np, y_pred_original)
print(f"Accuracy with original {X_train_np.shape[1]} features: {accuracy_original:.4f}")

# 2. Classifier on autoencoder-extracted features
lr_encoded = LogisticRegression(solver='liblinear', random_state=42)
lr_encoded.fit(X_train_encoded, y_train_np) # Use the 2D encoded features
y_pred_encoded = lr_encoded.predict(X_test_encoded)
accuracy_encoded = accuracy_score(y_test_np, y_pred_encoded)
print(f"Accuracy with autoencoder's {X_train_encoded.shape[1]} features: {accuracy_encoded:.4f}")

# Example Output:
# Accuracy with original 20 features: 0.8850
# Accuracy with autoencoder's 2 features: 0.8725

In this output, the accuracy with 2 features is slightly lower than with 20 features. This is not unexpected, as some information might be lost during compression. However, achieving comparable performance with significantly fewer features (2 vs. 20) demonstrates the power of autoencoders for dimensionality reduction. In some cases, especially with noisy or highly redundant original features, autoencoder-extracted features can even lead to improved performance by acting as a denoising or regularization mechanism. The trade-off between dimensionality reduction and performance is a common theme.

Summary of this Hands-on Session

In this hands-on section, we successfully:

Generated a synthetic tabular dataset.
Preprocessed the data using scaling and splitting.
Designed, built, and trained a simple autoencoder with a 2-dimensional bottleneck in PyTorch.
Used the trained encoder to extract low-dimensional features from our data.
Visualized these features in the latent space.
Briefly evaluated their utility by feeding them into a simple downstream classification model.

This practical exercise provides a foundational workflow for using autoencoders for feature extraction. You can adapt these steps to your own tabular datasets. Experimenting with the number of layers, neurons per layer, latent space dimensionality, and other hyperparameters (as discussed earlier in this chapter and will be further explored in Chapter 7) is key to achieving optimal results for your specific problem.

The next chapters will explore more advanced autoencoder architectures like Denoising Autoencoders, Convolutional Autoencoders for image data, and Variational Autoencoders, further expanding your toolkit for powerful feature extraction.

Was this section helpful?