Alright, let's build an autoencoder to extract features from tabular data. In the previous sections, we've discussed the theory: data preparation, network design, loss functions, and optimizers. Now, we'll put all that into practice. Our goal is to take a dataset with a certain number of features, train an autoencoder to reconstruct it, and then use the autoencoder's compressed internal representation (the bottleneck layer) as a new, lower-dimensional set of features.
We'll walk through the following steps:
This hands-on example will use PyTorch, but the principles are directly transferable to other frameworks.
First, ensure you have the necessary libraries installed. We'll primarily use torch
, numpy
, pandas
, and scikit-learn
.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
For this exercise, we'll generate a synthetic dataset using scikit-learn
. This gives us control over the data's characteristics and allows us to focus on the autoencoder mechanics. We'll create a dataset with 20 features. Our autoencoder will try to learn a compressed representation of these features.
from sklearn.datasets import make_classification
# Generate a synthetic dataset
X_orig, y_orig = make_classification(
n_samples=2000,
n_features=20, # Original number of features
n_informative=15, # Number of informative features
n_redundant=3, # Number of redundant features
n_repeated=0, # Number of duplicated features
n_classes=2, # Number of classes for the target variable y
n_clusters_per_class=2,
flip_y=0.01,
random_state=42
)
print(f"Original data shape: {X_orig.shape}")
# Original data shape: (2000, 20)
Here, X_orig
contains our features, and y_orig
is a binary class label. We'll use X_orig
to train the autoencoder in an unsupervised manner (it won't see y_orig
during training). Later, y_orig
will be useful for visualizing the latent space and for the optional classification task.
Neural networks, including autoencoders, generally perform better when input features are on a similar scale. We'll use MinMaxScaler
from scikit-learn
to scale our features to a range between 0 and 1.
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_orig)
# Split data into training and testing sets
# The autoencoder learns to reconstruct its input, so X is both input and target.
X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(
X_scaled, y_orig, test_size=0.2, random_state=42, stratify=y_orig
)
# Convert NumPy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_np, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_np, dtype=torch.long)
y_test_tensor = torch.tensor(y_test_np, dtype=torch.long)
# Create TensorDatasets and DataLoaders
batch_size = 32
train_dataset = TensorDataset(X_train_tensor, X_train_tensor) # Input and target are the same for AE
test_dataset = TensorDataset(X_test_tensor, X_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
print(f"Training data shape (PyTorch tensor): {X_train_tensor.shape}")
print(f"Test data shape (PyTorch tensor): {X_test_tensor.shape}")
# Training data shape (PyTorch tensor): torch.Size([1600, 20])
# Test data shape (PyTorch tensor): torch.Size([400, 20])
Scaling ensures that all features contribute more evenly to the learning process and helps the optimizer converge faster. The Sigmoid
activation function in the decoder's output layer is well-suited for data scaled to [0, 1].
We'll build a simple, fully-connected autoencoder. The architecture will consist of an encoder that compresses the input down to a small latent space, and a decoder that reconstructs the input from this latent representation.
Let's define the dimensionality of our bottleneck layer. This is a critical hyperparameter. A smaller bottleneck forces more aggressive compression and potentially more abstract feature learning. For this example, let's aim for a 2-dimensional latent space, which will be easy to visualize.
input_dim = X_train_tensor.shape[1] # Number of features, which is 20
latent_dim = 2 # Dimensionality of the bottleneck
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super(Autoencoder, self).__init__()
# Encoder
self.encoder = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, latent_dim),
nn.ReLU() # The bottleneck layer
)
# Decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, input_dim),
nn.Sigmoid() # Sigmoid for [0,1] output
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
autoencoder_model = Autoencoder(input_dim, latent_dim)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
autoencoder_model.to(device)
print(autoencoder_model)
The printout of autoencoder_model
shows the structure: encoder
as a Sequential
block that compresses 20 dimensions down to 2, and decoder
as another Sequential
block that expands it back to 20. The ReLU
activation is used in hidden layers, and Sigmoid
in the output layer because our input data is scaled between 0 and 1. If we used StandardScaler
, a linear activation might be more appropriate for the output.
Now, let's define the loss function and optimizer. We'll use the Adam optimizer and Mean Squared Error (MSELoss
) as the loss function. MSE
is suitable for reconstruction tasks with continuous (or scaled continuous) data.
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder_model.parameters(), lr=0.001)
We train the autoencoder by asking it to reconstruct the training data. The input and the target output are the same.
epochs = 100
train_losses = []
val_losses = []
for epoch in range(epochs):
# Training
autoencoder_model.train()
running_train_loss = 0.0
for data, _ in train_loader: # _ is the target, which is same as data
data = data.to(device)
optimizer.zero_grad()
outputs = autoencoder_model(data)
loss = criterion(outputs, data)
loss.backward()
optimizer.step()
running_train_loss += loss.item() * data.size(0)
epoch_train_loss = running_train_loss / len(train_loader.dataset)
train_losses.append(epoch_train_loss)
# Validation
autoencoder_model.eval()
running_val_loss = 0.0
with torch.no_grad():
for data, _ in test_loader: # _ is the target, which is same as data
data = data.to(device)
outputs = autoencoder_model(data)
loss = criterion(outputs, data)
running_val_loss += loss.item() * data.size(0)
epoch_val_loss = running_val_loss / len(test_loader.dataset)
val_losses.append(epoch_val_loss)
print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {epoch_train_loss:.4f}, Val Loss: {epoch_val_loss:.4f}')
After training, we can visualize the training and validation loss to check for overfitting and to see if the model has converged.
Autoencoder training progress, showing decreasing MSE loss on both training and validation sets over epochs. Ideally, both losses decrease and converge.
A good autoencoder will have a low reconstruction loss on the validation set.
Once the autoencoder is trained, the encoder part (autoencoder_model.encoder
) can be used as a feature extractor. We feed our original (scaled) data through the encoder to obtain the compressed, latent representations.
autoencoder_model.eval() # Set model to evaluation mode
with torch.no_grad(): # Disable gradient calculations
X_train_encoded = autoencoder_model.encoder(X_train_tensor.to(device)).cpu().numpy()
X_test_encoded = autoencoder_model.encoder(X_test_tensor.to(device)).cpu().numpy()
print(f"Original training data shape: {X_train_np.shape}")
print(f"Encoded training data shape: {X_train_encoded.shape}")
# Original training data shape: (1600, 20)
# Encoded training data shape: (1600, 2)
print(f"Original test data shape: {X_test_np.shape}")
print(f"Encoded test data shape: {X_test_encoded.shape}")
# Original test data shape: (400, 20)
# Encoded test data shape: (400, 2)
As you can see, we've successfully reduced the dimensionality of our data from 20 features down to 2 features. These 2 features are learned by the autoencoder and capture the most salient information needed to reconstruct the original data.
Since we chose latent_dim = 2
, we can easily visualize the extracted features in a 2D scatter plot. We can color the points using the original class labels (y_train_np
or y_test_np
) to see if the autoencoder has learned a representation that separates the classes, even though it wasn't trained explicitly for classification.
plt.figure(figsize=(8, 6))
plt.scatter(X_test_encoded[y_test_np == 0, 0], X_test_encoded[y_test_np == 0, 1],
label='Class 0', alpha=0.7, c='#4263eb')
plt.scatter(X_test_encoded[y_test_np == 1, 0], X_test_encoded[y_test_np == 1, 1],
label='Class 1', alpha=0.7, c='#f06595')
plt.title('2D Latent Space Visualization')
plt.xlabel('Latent Dimension 1')
plt.ylabel('Latent Dimension 2')
plt.legend(title='Original Class')
plt.grid(True)
plt.show()
Scatter plot of the 2D latent space representations from the test set, colored by their original class labels. Clusters or separation might indicate meaningful feature learning. (Note: The data points above are illustrative placeholders.)
If the plot shows some separation or clustering based on the original classes, it suggests that the autoencoder has learned features that are relevant to the underlying structure of the data.
The ultimate test of extracted features is their performance on a downstream task. Let's quickly train a simple Logistic Regression classifier on:
We'll compare their accuracy on the test set.
# 1. Classifier on original features
lr_original = LogisticRegression(solver='liblinear', random_state=42)
lr_original.fit(X_train_np, y_train_np)
y_pred_original = lr_original.predict(X_test_np)
accuracy_original = accuracy_score(y_test_np, y_pred_original)
print(f"Accuracy with original {X_train_np.shape[1]} features: {accuracy_original:.4f}")
# 2. Classifier on autoencoder-extracted features
lr_encoded = LogisticRegression(solver='liblinear', random_state=42)
lr_encoded.fit(X_train_encoded, y_train_np) # Use the 2D encoded features
y_pred_encoded = lr_encoded.predict(X_test_encoded)
accuracy_encoded = accuracy_score(y_test_np, y_pred_encoded)
print(f"Accuracy with autoencoder's {X_train_encoded.shape[1]} features: {accuracy_encoded:.4f}")
# Example Output:
# Accuracy with original 20 features: 0.8850
# Accuracy with autoencoder's 2 features: 0.8725
In this output, the accuracy with 2 features is slightly lower than with 20 features. This is not unexpected, as some information might be lost during compression. However, achieving comparable performance with significantly fewer features (2 vs. 20) demonstrates the power of autoencoders for dimensionality reduction. In some cases, especially with noisy or highly redundant original features, autoencoder-extracted features can even lead to improved performance by acting as a denoising or regularization mechanism. The trade-off between dimensionality reduction and performance is a common theme.
In this hands-on section, we successfully:
This practical exercise provides a foundational workflow for using autoencoders for feature extraction. You can adapt these steps to your own tabular datasets. Experimenting with the number of layers, neurons per layer, latent space dimensionality, and other hyperparameters (as discussed earlier in this chapter and will be further explored in Chapter 7) is key to achieving optimal results for your specific problem.
The next chapters will explore more advanced autoencoder architectures like Denoising Autoencoders, Convolutional Autoencoders for image data, and Variational Autoencoders, further expanding your toolkit for powerful feature extraction.
Was this section helpful?
© 2025 ApX Machine Learning