Hands-on: Node Classification on the Cora Dataset with PyG

A practical demonstration of PyTorch Geometric involves building, training, and evaluating a Graph Neural Network for semi-supervised node classification on the Cora dataset. This exercise uses the Cora dataset, a standard benchmark in graph machine learning, to cover data handling, model definition, and training loops within the PyG framework.

The Cora dataset is a citation network. Each node in the graph represents a scientific publication, and a directed edge from node A to node B indicates that publication A cites publication B. Each paper is described by a binary word vector indicating the presence or absence of words from a fixed dictionary, which serves as the node features. The task is to classify each publication into one of seven predefined academic subjects.

Loading and Inspecting the Cora Dataset

PyTorch Geometric includes a collection of common benchmark datasets, including Cora, which can be loaded with just a single line of code. Let's start by importing the necessary libraries and loading the dataset.

import torch
import torch.nn.functional as F
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# Load the Cora dataset
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

# Print information about the dataset
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

# Print information about the graph
print(f'\nGraph:')
print('------')
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Training nodes: {data.train_mask.sum()}')
print(f'Validation nodes: {data.val_mask.sum()}')
print(f'Test nodes: {data.test_mask.sum()}')

Running this code will download the dataset and print a summary. You'll notice that the dataset object contains a single graph, which we access with dataset[0]. This data object holds all the graph information:

data.x: The node feature matrix with shape [num_nodes, num_features]. For Cora, this is [2708, 1433].
data.edge_index: The graph connectivity in Coordinate Format (COO) with shape [2, num_edges].
data.y: The ground-truth label for each node.
data.train_mask, data.val_mask, data.test_mask: Boolean masks that identify which nodes to use for training, validation, and testing. This pre-defined split is characteristic of semi-supervised tasks where we train on a small fraction of labeled nodes.

Defining a Graph Convolutional Network

Next, we define our GNN architecture. We will use a simple two-layer Graph Convolutional Network (GCN). The first GCNConv layer will map the input features to a lower-dimensional hidden representation (e.g., 16 dimensions). The second GCNConv layer will map the hidden representations to the final number of classes (7 for Cora).

We define our model as a class that inherits from torch.nn.Module, which is the standard PyTorch convention.

class GCN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_features, 16)
        self.conv2 = GCNConv(16, num_classes)

    def forward(self, x, edge_index):
        # First GCN layer
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)

        # Second GCN layer
        x = self.conv2(x, edge_index)

        return x

model = GCN(dataset.num_features, dataset.num_classes)
print(model)

The forward method specifies the computation at each call. It applies the first convolution, followed by a ReLU activation function and a dropout layer for regularization. The output of the first stage is then passed to the second convolutional layer. The final output provides the raw logits for each of the 7 classes for every node in the graph.

Training the GNN Model

With the data and model ready, we can write the training loop. We will use the Adam optimizer and the cross-entropy loss function. A significant aspect of training on this dataset is that we only compute the loss on the training nodes, as indicated by data.train_mask. The model sees the entire graph structure (all edge_index) during training but only uses the labels of the training nodes to calculate gradients.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN(dataset.num_features, dataset.num_classes).to(device)
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

def train():
    model.train()
    optimizer.zero_grad()
    # Perform a single forward pass
    out = model(data.x, data.edge_index)
    # Compute the loss solely on the training nodes
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    # Derive gradients
    loss.backward()
    # Update parameters
    optimizer.step()
    return loss

for epoch in range(1, 201):
    loss = train()
    if epoch % 20 == 0:
        print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

The training function encapsulates the standard PyTorch training steps. We set the model to training mode, clear gradients, perform a forward pass, compute the loss on the training set, backpropagate, and update the model's weights. We run this process for 200 epochs, printing the loss every 20 epochs to monitor progress.

Evaluating Model Performance

After training, we need to evaluate the model's performance on the test set. We create a separate function for evaluation. This function sets the model to evaluation mode (model.eval()) to disable layers like dropout. It then calculates the model's predictions for all nodes and compares them against the true labels for the nodes in the test set.

@torch.no_grad()
def test():
    model.eval()
    # Forward pass on the entire graph
    out = model(data.x, data.edge_index)
    # Get the predicted class index
    pred = out.argmax(dim=1)

    # Calculate accuracy on the test set
    test_correct = pred[data.test_mask] == data.y[data.test_mask]
    test_acc = int(test_correct.sum()) / int(data.test_mask.sum())
    return test_acc

# Train and print final test accuracy
for epoch in range(1, 201):
    train()

final_accuracy = test()
print(f'Final Test Accuracy: {final_accuracy:.4f}')

After 200 epochs of training, you should see a test accuracy of around 80-82%. This result is quite strong, especially given that the model only trained on 140 labeled nodes out of 2708. This demonstrates the power of GNNs in leveraging graph structure to propagate label information from a few nodes to many.

Visualizing Node Embeddings

A GNN learns to create powerful embeddings for each node that capture both its features and its local graph neighborhood. We can visualize these embeddings to get an intuition for what the model has learned. The output of our first GCNConv layer is a 16-dimensional embedding for each node. We can use a dimensionality reduction technique like t-SNE to project these embeddings into a 2D space and plot them. A well-trained model should produce embeddings where nodes of the same class form distinct clusters.

Let's extract the embeddings from our trained model and visualize them.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Get the 16-dimensional node embeddings
model.eval()
with torch.no_grad():
    h = model.conv1(data.x, data.edge_index)
    h = F.relu(h)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
tsne_results = tsne.fit_transform(h.cpu().numpy())

# Plot the results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=data.y.cpu().numpy(), cmap='jet', alpha=0.7)
plt.legend(handles=scatter.legend_elements()[0], labels=dataset.classes)
plt.title('t-SNE visualization of Cora node embeddings')
plt.xlabel('t-SNE feature 1')
plt.ylabel('t-SNE feature 2')
plt.show()

To avoid running the code, a pre-generated chart of what the output might look like is shown below.

The t-SNE plot shows the 16-dimensional node embeddings projected onto a 2D plane. Each point represents a scientific paper, colored by its subject category. The clear clustering of colors indicates that the GNN has learned to group papers with similar subjects together in the embedding space.

The visualization confirms our quantitative results. The model has successfully learned representations that separate the different classes of documents, which is exactly the goal of representation learning in this context. This complete workflow, from loading data to training a model and inspecting its outputs, serves as a template for tackling other graph-based machine learning problems with PyTorch Geometric.

Was this section helpful?

References

PyTorch Geometric Documentation, PyTorch Geometric Developers, 2023 - Official documentation for PyTorch Geometric, providing comprehensive guides and API references for building and training GNNs.
Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf and Max Welling, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1609.02907 - Introduces the Graph Convolutional Network (GCN) model, the core architecture used in this hands-on section, and demonstrates its effectiveness in semi-supervised node classification on datasets like Cora.