Okay, let's put the concepts from this chapter into practice. We'll build a complete data pipeline, starting from raw data (which we'll synthesize for simplicity) and ending with batches ready to be fed into a model. This involves creating a custom Dataset
, defining data transformations, and wrapping everything in a DataLoader
.
Imagine we have a dataset consisting of feature vectors and corresponding binary classification labels (0 or 1). For this exercise, we'll generate this data directly using PyTorch tensors. This avoids file I/O complexities and lets us focus purely on the data handling mechanism.
import torch
import torch.utils.data as data
from torchvision import transforms
# Generate synthetic data
num_samples = 100
num_features = 10
# Create random feature vectors (e.g., sensor readings)
features = torch.randn(num_samples, num_features)
# Create random binary labels (0 or 1)
labels = torch.randint(0, 2, (num_samples,))
print(f"Shape of features: {features.shape}") # Output: torch.Size([100, 10])
print(f"Shape of labels: {labels.shape}") # Output: torch.Size([100])
print(f"First 5 features:\n{features[:5]}")
print(f"First 5 labels:\n{labels[:5]}")
This gives us two tensors: features
containing 100 samples, each with 10 features, and labels
containing the corresponding 100 labels.
Dataset
Now, we need to structure this data using PyTorch's Dataset
class. We'll create a custom class that inherits from torch.utils.data.Dataset
and implements two essential methods:
__len__(self)
: Returns the total number of samples in the dataset.__getitem__(self, idx)
: Returns the sample (features and label) at the given index idx
.We'll also add an __init__
method to store our data and optionally accept transformations.
class SyntheticDataset(data.Dataset):
"""A custom Dataset for our synthetic features and labels."""
def __init__(self, features, labels, transform=None):
"""
Args:
features (Tensor): Tensor containing the feature data.
labels (Tensor): Tensor containing the labels.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
# Basic check to ensure features and labels have the same number of samples
assert features.shape[0] == labels.shape[0], \
"Features and labels must have the same number of samples"
self.features = features
self.labels = labels
self.transform = transform
def __len__(self):
"""Returns the total number of samples."""
return self.features.shape[0]
def __getitem__(self, idx):
"""
Retrieves the feature vector and label for a given index.
Args:
idx (int): Index of the sample to retrieve.
Returns:
tuple: (feature, label) where feature is the feature vector
and label is the corresponding label.
"""
# Get the raw feature and label
feature_sample = self.features[idx]
label_sample = self.labels[idx]
# Create a sample dictionary (or tuple)
sample = {'feature': feature_sample, 'label': label_sample}
# Apply transformations if they exist
if self.transform:
sample = self.transform(sample)
# Return the potentially transformed sample
# Common practice is to return features and labels separately
return sample['feature'], sample['label']
# Instantiate the dataset without transforms for now
raw_dataset = SyntheticDataset(features, labels)
# Test retrieving a sample
sample_idx = 0
feature_sample, label_sample = raw_dataset[sample_idx]
print(f"\nSample {sample_idx} - Feature: {feature_sample}")
print(f"Sample {sample_idx} - Label: {label_sample}")
print(f"Dataset length: {len(raw_dataset)}") # Output: 100
At this point, raw_dataset
holds our data and knows how to provide individual samples.
Often, raw data isn't suitable for direct input into a neural network. We might need to normalize features, convert data types, or apply augmentations (especially for images). torchvision.transforms
provides convenient tools for this. Even though our data isn't images, we can define custom transformations or use existing ones that operate on tensors.
Let's define a simple transformation pipeline:
torch.float32
(good practice for model inputs).torch.long
(often required by loss functions like CrossEntropyLoss
).Since torchvision.transforms
are primarily designed for images (PIL Image or Tensor), applying them directly to a dictionary like our sample
requires a bit of wrapping. We'll create custom callable classes or lambda functions for this.
# Calculate mean and std deviation for normalization (across the dataset)
feature_mean = features.mean(dim=0)
feature_std = features.std(dim=0)
# Avoid division by zero if std dev is zero for any feature
feature_std[feature_std == 0] = 1.0
# Define custom transform classes/functions for our dictionary sample format
class ToTensorAndType(object):
"""Converts features to FloatTensor and labels to LongTensor."""
def __call__(self, sample):
feature, label = sample['feature'], sample['label']
return {'feature': feature.float(), 'label': label.long()}
class NormalizeFeatures(object):
"""Normalizes the feature tensor."""
def __init__(self, mean, std):
self.mean = mean
self.std = std
def __call__(self, sample):
feature, label = sample['feature'], sample['label']
# Apply normalization: (tensor - mean) / std
normalized_feature = (feature - self.mean) / self.std
return {'feature': normalized_feature, 'label': label}
# Compose the transformations
data_transforms = transforms.Compose([
ToTensorAndType(),
NormalizeFeatures(mean=feature_mean, std=feature_std)
])
# Instantiate the dataset WITH the transformations
transformed_dataset = SyntheticDataset(features, labels, transform=data_transforms)
# Test retrieving a transformed sample
sample_idx = 0
transformed_feature, transformed_label = transformed_dataset[sample_idx]
print(f"\n--- Transformed Sample {sample_idx} ---")
print(f"Original Feature:\n{features[sample_idx]}")
print(f"Transformed Feature:\n{transformed_feature}")
print(f"Original Label: {labels[sample_idx]} (dtype={labels.dtype})")
print(f"Transformed Label: {transformed_label} (dtype={transformed_label.dtype})")
# Verify normalization (mean should be close to 0, std close to 1 for the first sample's features)
print(f"Transformed Feature Mean: {transformed_feature.mean():.4f}") # Should be near 0 after normalization applied across dataset
Notice how the feature values have changed due to normalization, and the data types for both feature and label are now torch.float32
and torch.int64
(LongTensor) respectively.
DataLoader
The final step is to use DataLoader
. It takes our Dataset
instance and handles batching, shuffling, and potentially parallel data loading.
# Create the DataLoader
batch_size = 16 # Process data in batches of 16 samples
shuffle_data = True # Shuffle the data at the beginning of each epoch
num_workers = 0 # Number of subprocesses to use for data loading. 0 means data loading happens in the main process.
# On platforms other than Windows, you can often set num_workers > 0 for parallel loading
# import os
# if os.name != 'nt': # Check if not Windows
# num_workers = 2
data_loader = data.DataLoader(
transformed_dataset,
batch_size=batch_size,
shuffle=shuffle_data,
num_workers=num_workers
)
# Iterate through the DataLoader to get batches
print(f"\n--- Iterating through DataLoader (batch_size={batch_size}) ---")
# Get one batch
feature_batch, label_batch = next(iter(data_loader))
print(f"Type of feature_batch: {type(feature_batch)}")
print(f"Shape of feature_batch: {feature_batch.shape}") # Output: torch.Size([16, 10])
print(f"Shape of label_batch: {label_batch.shape}") # Output: torch.Size([16])
print(f"Data type of feature_batch: {feature_batch.dtype}") # Output: torch.float32
print(f"Data type of label_batch: {label_batch.dtype}") # Output: torch.int64
# You can loop through all batches like this (e.g., in a training epoch)
# print("\nLooping through a few batches:")
# for i, (batch_features, batch_labels) in enumerate(data_loader):
# if i >= 3: # Show first 3 batches
# break
# print(f"Batch {i+1}: Features shape={batch_features.shape}, Labels shape={batch_labels.shape}")
# # In a real training loop, you would feed batch_features to your model here
The DataLoader
yields batches where the first dimension corresponds to the batch_size
. Our features batch has shape [16, 10]
, and the labels batch has shape [16]
. The data types reflect the transformations we applied.
We can visualize the flow we just created:
This diagram shows the progression from raw tensors to a custom
Dataset
, applying transformations during data retrieval (__getitem__
), and finally using aDataLoader
to produce shuffled batches suitable for model training.
You have now successfully built a data pipeline using PyTorch's core data utilities. You created a Dataset
to wrap your data, applied necessary transforms
, and used a DataLoader
to efficiently generate batches. This structured approach is fundamental for handling data in almost any PyTorch project, ensuring your models receive data in the correct format and facilitating efficient training. This pipeline is now ready to be integrated into the training loop we will construct in the next chapter.
© 2025 ApX Machine Learning