Having explored the roles of torch.utils.data.Dataset
and torch.utils.data.DataLoader
, it's time to put this knowledge into practice. In this section, we'll walk through the process of creating a custom Dataset
for a synthetic tabular dataset and then use a DataLoader
to prepare it for model training. This hands-on exercise will solidify your understanding of how PyTorch handles data pipelines, providing a contrast to how you might achieve similar results with tf.data
.
Let's imagine we have a simple binary classification task. Our data consists of numerical features and a corresponding binary label (0 or 1). We'll generate this data synthetically for simplicity.
The first step is to define a class that inherits from torch.utils.data.Dataset
. This class must implement three special methods:
__init__(self, ...)
: The constructor. This is where you'll typically load your data (e.g., from files, a database, or generate it on the fly). You can also perform any one-time preprocessing here.__len__(self)
: This method should return the total number of samples in your dataset. The DataLoader
uses this to know the extent of the dataset.__getitem__(self, idx)
: This method is responsible for retrieving a single sample (features and corresponding label) from your dataset at the given index idx
. This is also where you'd typically apply transformations specific to an individual sample, such as converting data to PyTorch tensors.Let's create our SyntheticTabularDataset
:
import torch
from torch.utils.data import Dataset
import numpy as np
class SyntheticTabularDataset(Dataset):
def __init__(self, num_samples=1000, num_features=10):
"""
Constructor for the SyntheticTabularDataset.
Args:
num_samples (int): The total number of samples to generate.
num_features (int): The number of features for each sample.
"""
super().__init__() # Good practice to call parent constructor
self.num_samples = num_samples
self.num_features = num_features
# Generate synthetic features (random numbers from a normal distribution)
# In a real scenario, you would load your features from a file or other source.
self.features = np.random.randn(num_samples, num_features).astype(np.float32)
# Generate synthetic labels (random binary labels: 0 or 1)
self.labels = np.random.randint(0, 2, num_samples).astype(np.int64)
# For TensorFlow users: at this stage, you might have your data in NumPy arrays.
# You might then use tf.data.Dataset.from_tensor_slices((self.features, self.labels))
# to create a TensorFlow Dataset. Here, we're defining a PyTorch Dataset class.
def __len__(self):
"""
Returns the total number of samples in the dataset.
"""
return self.num_samples
def __getitem__(self, idx):
"""
Retrieves the sample (features and label) at the given index.
Args:
idx (int): The index of the sample to retrieve.
Returns:
tuple: (features, label) where features is a PyTorch tensor
and label is a PyTorch tensor.
"""
# Fetch the specific sample
sample_features = self.features[idx]
sample_label = self.labels[idx]
# Convert NumPy arrays to PyTorch tensors
# This is a common operation done in __getitem__
# In TensorFlow, tf.data handles tensor conversions implicitly
# when data is passed to from_tensor_slices or through map transformations.
return torch.from_numpy(sample_features), torch.tensor(sample_label)
In the __init__
method, we generate our features and labels using NumPy and store them as instance attributes. np.float32
is a common data type for features, and np.int64
is typical for classification labels (especially when using loss functions like CrossEntropyLoss
).
The __len__
method is straightforward; it simply returns self.num_samples
.
The __getitem__
method takes an index idx
, retrieves the corresponding features and label from our NumPy arrays, and then converts them into PyTorch tensors. torch.from_numpy
is used for array-to-tensor conversion, and torch.tensor
is suitable for scalar labels. This conversion to tensors is important, as PyTorch models expect tensor inputs.
Now that we've defined our SyntheticTabularDataset
, let's create an instance of it and see how to access individual samples:
# Create an instance of our custom dataset
dataset = SyntheticTabularDataset(num_samples=50, num_features=3)
# Check the length of the dataset
print(f"Dataset length: {len(dataset)}")
# Get a single sample (e.g., the first sample at index 0)
features_sample, label_sample = dataset[0]
print(f"\nFirst sample:")
print(f" Features: {features_sample}")
print(f" Label: {label_sample}")
print(f" Features shape: {features_sample.shape}, dtype: {features_sample.dtype}")
print(f" Label shape: {label_sample.shape}, dtype: {label_sample.dtype}")
# Get another sample
features_sample_2, label_sample_2 = dataset[10]
print(f"\nSample at index 10:")
print(f" Features: {features_sample_2}")
print(f" Label: {label_sample_2}")
Running this code will show you the total number of samples and the structure of individual samples retrieved from the dataset, already conveniently converted to PyTorch tensors.
While accessing individual samples is useful for inspection, for training a model, we need to process data in batches, shuffle it, and potentially load it in parallel. This is where torch.utils.data.DataLoader
comes in.
The DataLoader
takes a Dataset
object as input and provides an iterable over it. Key parameters include:
dataset
: The Dataset
object from which to load the data.batch_size
: The number of samples per batch.shuffle
: If True
, the data is reshuffled at every epoch. This is generally recommended for training data to prevent the model from learning the order of samples.num_workers
: The number of subprocesses to use for data loading. Setting num_workers > 0
enables multi-process data loading, which can significantly speed up data fetching, especially if __getitem__
involves I/O operations or significant computation. For simple, in-memory datasets like our synthetic example, num_workers=0
(default, loads in the main process) is often fine.Let's create a DataLoader
for our SyntheticTabularDataset
:
from torch.utils.data import DataLoader
# Re-instantiate the dataset, perhaps with more samples for a typical training scenario
train_dataset = SyntheticTabularDataset(num_samples=1000, num_features=5)
# Create a DataLoader
batch_size = 32
# For TensorFlow users: DataLoader combines the functionalities of
# tf.data.Dataset.shuffle() and tf.data.Dataset.batch().
train_dataloader = DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0) # Use 0 for simplicity here; try >0 for real tasks
# Iterate over the DataLoader to get batches
print(f"\nIterating through DataLoader (first 2 batches as an example):")
for i, (batch_features, batch_labels) in enumerate(train_dataloader):
if i < 2: # Print info for the first two batches
print(f" Batch {i+1}:")
print(f" Features batch shape: {batch_features.shape}") # [batch_size, num_features]
print(f" Labels batch shape: {batch_labels.shape}") # [batch_size]
print(f" Features batch dtype: {batch_features.dtype}")
print(f" Labels batch dtype: {batch_labels.dtype}")
else:
break # Stop after showing two batches for brevity
# You can also iterate over it in a typical training loop:
# num_epochs = 3
# for epoch in range(num_epochs):
# print(f"\nEpoch {epoch+1}/{num_epochs}")
# for batch_idx, (features, labels) in enumerate(train_dataloader):
# # In a real training loop:
# # 1. Move data to device (e.g., GPU)
# # features, labels = features.to(device), labels.to(device)
# # 2. Forward pass: model_output = model(features)
# # 3. Calculate loss: loss = criterion(model_output, labels)
# # 4. Backward pass: loss.backward()
# # 5. Optimizer step: optimizer.step()
# # 6. Zero gradients: optimizer.zero_grad()
# if batch_idx % 10 == 0: # Print progress every 10 batches
# print(f" Processed batch {batch_idx+1}/{len(train_dataloader)}")
# print("-" * 30)
When you run this, you'll observe that batch_features
is a tensor of shape (batch_size, num_features)
and batch_labels
is a tensor of shape (batch_size,)
. The DataLoader
has efficiently grouped individual samples from your Dataset
into these batches. If shuffle=True
, the order of samples within these batches (and the order of batches themselves) will be different each time you iterate through train_dataloader
(e.g., at the start of each training epoch).
In our SyntheticTabularDataset
, we directly converted NumPy arrays to PyTorch tensors within the __getitem__
method. For more complex preprocessing or data augmentation (especially common with image data), PyTorch Dataset
classes often accept a transform
argument in their __init__
method. This transform
is typically a callable (like a function or an object with a __call__
method) that is applied to the sample in __getitem__
before it's returned.
For example, if you were working with images, you might pass a series of transformations from torchvision.transforms
(like resizing, cropping, normalization, and conversion to tensor) to your custom image dataset.
# Hypothetical example of how a transform might be used
# class MyImageDataset(Dataset):
# def __init__(self, image_paths, labels, transform=None):
# self.image_paths = image_paths
# self.labels = labels
# self.transform = transform
#
# def __len__(self):
# return len(self.image_paths)
#
# def __getitem__(self, idx):
# image = Image.open(self.image_paths[idx]) # Load image
# label = self.labels[idx]
# if self.transform:
# image = self.transform(image) # Apply transformations
# return image, label
# from torchvision import transforms
# image_transform = transforms.Compose([
# transforms.Resize((256, 256)),
# transforms.RandomCrop(224),
# transforms.ToTensor(),
# transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# ])
# image_dataset = MyImageDataset(paths, labels, transform=image_transform)
This approach keeps the data loading logic clean and allows for flexible composition of preprocessing steps. For our tabular data, the direct tensor conversion in __getitem__
was sufficient, but it's good to be aware of this common pattern for more advanced use cases.
This practical exercise demonstrates the fundamental PyTorch pattern for data handling: define how to get a single processed item with Dataset
, and then use DataLoader
to efficiently batch, shuffle, and iterate over these items. This separation of concerns offers considerable flexibility, similar to how you might define data sources and then apply transformations like .batch()
and .shuffle()
in TensorFlow's tf.data
API, but with a more explicit Python class-based structure for the Dataset
itself. You now have the building blocks to create efficient data pipelines for a wide variety of data types in PyTorch.
© 2025 ApX Machine Learning