Working with PyG Datasets

Many graph neural network applications require working with collections of graphs or large, standard benchmark datasets. Although a Data object effectively represents a single graph, manually creating these objects for each instance in a collection would be tedious and error-prone. PyTorch Geometric simplifies this process with its powerful torch_geometric.datasets module, which provides easy access to a wide variety of common graph datasets.

This module automates the downloading, processing, and formatting of data, allowing you to load an entire dataset with just a single line of code. This frees you to concentrate on the model architecture rather than the complexities of data preparation.

Accessing Built-in Datasets

PyTorch Geometric comes pre-packaged with dozens of benchmark datasets. These range from citation networks like Cora and PubMed, social networks, and bioinformatics graphs to 3D mesh datasets like ModelNet10.

Let's start by loading a dataset suited for graph classification, where the task is to predict a property for each entire graph. The TUDataset collection is a popular source for this, containing datasets like ENZYMES, PROTEINS, and IMDB-BINARY. We can load the ENZYMES dataset, which consists of 600 graphs representing protein structures.

from torch_geometric.datasets import TUDataset

# The 'root' directory is where the dataset will be stored.
# PyG will automatically download it if it's not found.
dataset = TUDataset(root='data/TUDataset', name='ENZYMES')

print(f'Dataset: {dataset}')
print('====================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_node_features}')
print(f'Number of classes: {dataset.num_classes}')

When you run this code for the first time, PyG automatically downloads the raw data into the data/TUDataset/ENZYMES/raw directory and then converts it into a processed format saved in data/TUDataset/ENZYMES/processed. Subsequent runs will load the processed data directly.

The output shows some useful properties:

Dataset: ENZYMES(600)
====================
Number of graphs: 600
Number of features: 3
Number of classes: 6

Anatomy of a Dataset Object

A PyG Dataset object functions much like a standard Python list. You can get its length with len(), and you can access individual graphs using standard list indexing. Each element is a Data object, just like the one we explored in the previous section.

A PyG Dataset can be seen as a collection of individual Data objects, each representing a single graph.

Let's inspect the first graph in our ENZYMES dataset:

data = dataset[0]

print(data)
print('==============================================================')

# Get some stats about the first graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Is undirected: {data.is_undirected()}')

This will produce output similar to:

Data(x=[37, 3], edge_index=[2, 168], y=[1], G_I=0)
==============================================================
Number of nodes: 37
Number of edges: 168
Is undirected: True

This first graph has 37 nodes, each with 3 features. Its label, y, indicates which of the 6 enzyme classes it belongs to. Since Dataset objects are iterable, they integrate smoothly with PyG's data loaders for mini-batch training, which we will discuss later. You can also easily shuffle the dataset, a common first step before training.

# Shuffle the dataset in-place
shuffled_dataset = dataset.shuffle()

print(shuffled_dataset[0])

Datasets for Node Classification

Graph classification involves many small graphs. In contrast, node classification typically involves a single, large graph. PyTorch Geometric provides a different class, Planetoid, for handling famous citation network datasets like Cora, CiteSeer, and PubMed.

When you load one of these, the Dataset object contains just one Data object representing the entire graph.

from torch_geometric.datasets import Planetoid

# Load the Cora dataset
dataset = Planetoid(root='data/Planetoid', name='Cora')

print(f'Dataset: {dataset}')
print('====================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_node_features}')
print(f'Number of classes: {dataset.num_classes}')

# Get the single Data object
data = dataset[0]
print(f'\nGraph object: {data}')

The output confirms that we have one large graph:

Dataset: Cora()
====================
Number of graphs: 1
Number of features: 1433
Number of classes: 7

Graph object: Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

The Cora graph has 2,708 nodes (scientific papers) and 10,556 edges (citations). Each paper is represented by a 1433-dimensional feature vector. The important difference here is the presence of train_mask, val_mask, and test_mask attributes. These are boolean tensors that specify which nodes should be used for training, validation, and testing, respectively. This is standard for the semi-supervised, transductive learning setting common in node classification benchmarks.

In a transductive setting, the model has access to the entire graph's feature matrix and structure during training, but only the labels of the training nodes. The masks determine which nodes are used for calculating the loss (train_mask), tuning hyperparameters (val_mask), and final evaluation (test_mask).

Creating Custom Datasets

While the built-in datasets are excellent for research and learning, you will often need to work with your own data. PyTorch Geometric provides base classes to simplify this process. For smaller datasets that can fit entirely in memory, you can inherit from torch_geometric.data.InMemoryDataset.

To create your own InMemoryDataset, you need to implement four primary methods:

raw_file_names(): Returns a list of strings for the raw file names that must be present in the self.raw_dir to skip the download.
processed_file_names(): Returns a string for the file name of the processed dataset. PyG will look for this file in self.processed_dir and skip the process step if it is found.
download(): Contains the logic for downloading your raw data into the self.raw_dir.
process(): The core method where you read your raw data, create a list of Data objects, and save the final list to disk using torch.save(self.collate(data_list), self.processed_paths[0]).

This structured approach ensures that your custom datasets are reusable, shareable, and behave just like the built-in ones. It handles the behind-the-scenes logic of checking for processed data, so your preprocessing code only runs once. With your data properly loaded, you are now ready to define a GNN model to learn from it.

Was this section helpful?

References

PyTorch Geometric Documentation, PyTorch Geometric Team, 2023 - The official documentation for PyTorch Geometric, providing detailed API references and usage examples for datasets, including built-in datasets and custom dataset creation.
TUDataset Collection, Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, Marion Neumann, 2023 - An authoritative resource for a broad collection of benchmark graph datasets, frequently used for graph classification tasks, including the ENZYMES dataset.
Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf and Max Welling, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1609.02907 - Presents Graph Convolutional Networks and established the common train/validation/test splits for citation network datasets like Cora, which are foundational for node classification benchmarks.