All Courses

Graph Neural Networks with PyTorch Geometric

Many real-world datasets are inherently relational, best represented as graphs. Examples include social networks, molecular structures, citation networks, knowledge graphs, and recommendation systems. Traditional deep learning architectures like CNNs and RNNs assume grid-like or sequential data structures, making them less suitable for the arbitrary connections found in graphs. Graph Neural Networks (GNNs) are specifically designed to operate directly on graph-structured data, learning representations that incorporate both node features and the graph's topology.

PyTorch Geometric (PyG) is a powerful and widely adopted library built upon PyTorch for developing and applying GNNs. It provides optimized implementations of various GNN layers, efficient data handling for graphs, and common graph benchmark datasets. This section will guide you through using PyG to implement and understand different GNN architectures.

Representing Graphs in PyTorch Geometric

Before building GNN models, we need a standardized way to represent graph data. PyG uses the torch_geometric.data.Data object. A Data object holds various attributes describing a single graph:

x: Node feature matrix with shape [num_nodes, num_node_features]. Each row represents a node, and columns represent its features.
edge_index: Graph connectivity in COO (Coordinate) format with shape [2, num_edges]. It stores the source and target node indices for each edge. For an edge from node j to node i, you'd have [j, i] as a column. This representation is efficient for sparse graphs.
edge_attr: Edge feature matrix with shape [num_edges, num_edge_features]. Optional features associated with each edge.
y: Target labels or values, depending on the task. For node-level tasks, shape [num_nodes, ...]; for graph-level tasks, shape [1, ...].
pos: Node positional features with shape [num_nodes, num_dimensions]. Often used in geometric deep learning.

Here's how you might create a simple Data object:

import torch
from torch_geometric.data import Data

# Node features: 3 nodes, 2 features each
x = torch.tensor([[1, 2], [3, 4], [5, 6]], dtype=torch.float)

# Edges: (0 -> 1), (1 -> 0), (1 -> 2), (2 -> 1)
# Represented as source nodes and target nodes
edge_index = torch.tensor([[0, 1, 1, 2],  # Source nodes
                           [1, 0, 2, 1]], # Target nodes
                          dtype=torch.long)

# Optional edge features: 4 edges, 1 feature each
edge_attr = torch.tensor([[0.5], [0.5], [0.8], [0.8]], dtype=torch.float)

# Optional node labels (e.g., for node classification)
y = torch.tensor([0, 1, 0], dtype=torch.long)

# Create the Data object
graph_data = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y=y)

print(graph_data)
# Output: Data(x=[3, 2], edge_index=[2, 4], edge_attr=[4, 1], y=[3])

PyG also provides torch_geometric.data.Dataset and torch_geometric.loader.DataLoader for handling collections of graphs and creating mini-batches efficiently. The DataLoader automatically handles the collation of graphs with varying sizes into larger batch objects.

The Message Passing Method

Most GNN layers operate based on the message passing principle. The core idea is that each node iteratively updates its feature representation (embedding) by aggregating information from its local neighborhood. This process typically involves three steps for each node $i$ at layer $l$ :

Message Computation: Each neighboring node $j \in \mathcal{N}(i)$ computes a message $\mathbf{m}_{j \to i}^{(l)}$ based on its own features $\mathbf{h}_j^{(l-1)}$ and potentially the features of the target node $\mathbf{h}_i^{(l-1)}$ and the edge features $\mathbf{e}_{j,i}$ .
$\mathbf{m}_{j \to i}^{(l)} = \phi^{(l)}(\mathbf{h}_i^{(l-1)}, \mathbf{h}_j^{(l-1)}, \mathbf{e}_{j,i})$
where $\phi^{(l)}$ is a differentiable message function (e.g., a neural network).
Aggregation: Node $i$ aggregates all incoming messages from its neighbors using a permutation-invariant function $\bigoplus$ (like sum, mean, or max).
$\mathbf{a}_i^{(l)} = \bigoplus_{j \in \mathcal{N}(i)} \mathbf{m}_{j \to i}^{(l)}$
Update: Node $i$ updates its feature vector $\mathbf{h}_i^{(l)}$ based on its previous representation $\mathbf{h}_i^{(l-1)}$ and the aggregated message $\mathbf{a}_i^{(l)}$ .
$\mathbf{h}_i^{(l)} = \gamma^{(l)}(\mathbf{h}_i^{(l-1)}, \mathbf{a}_i^{(l)})$
where $\gamma^{(l)}$ is a differentiable update function (e.g., another neural network or simply adding the aggregated message).

The initial features $\mathbf{h}_i^{(0)}$ are typically the input node features data.x. Stacking multiple message passing layers allows information to propagate across larger distances in the graph.

Diagram illustrating the message passing concept for updating node $i$ . Information from neighbors $j_1, j_2, j_3$ (messages $m_1, m_2, m_3$ ) is aggregated and combined with the node's previous state $h_i^{(l-1)}$ to compute the new state $h_i^{(l)}$ .

PyG provides optimized implementations of these steps within its layer classes.

Common GNN Layers in PyTorch Geometric

PyG offers a wide variety of pre-implemented GNN layers. Let's look at three popular examples: GCN, GraphSAGE, and GAT.

Graph Convolutional Network (GCN)

GCN layers, introduced by Kipf & Welling (2017), perform a spectral-based graph convolution. The message passing update rule for a GCN layer can be simplified to:

\mathbf{H}^{(l+1)} = \sigma\left(\hat{\mathbf{D}}^{-1/2} \hat{\mathbf{A}} \hat{\mathbf{D}}^{-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)}\right)

Here, $\mathbf{H}^{(l)}$ is the matrix of node embeddings at layer $l$ , $\mathbf{W}^{(l)}$ is a trainable weight matrix, $\sigma$ is an activation function (like ReLU), $\hat{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ is the adjacency matrix with added self-loops, and $\hat{\mathbf{D}}$ is the diagonal degree matrix of $\hat{\mathbf{A}}$ . The term $\hat{\mathbf{D}}^{-1/2} \hat{\mathbf{A}} \hat{\mathbf{D}}^{-1/2}$ represents a symmetric normalization of the adjacency matrix. This layer averages the features of neighboring nodes (including the node itself) and then applies a linear transformation followed by a non-linearity.

In PyG, you use torch_geometric.nn.GCNConv:

import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class SimpleGCN(torch.nn.Module):
    def __init__(self, num_node_features, num_classes, hidden_channels):
        super().__init__()
        self.conv1 = GCNConv(num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training) # Dropout often used
        x = self.conv2(x, edge_index)

        # For node classification, often LogSoftmax is used
        return F.log_softmax(x, dim=1)

GraphSAGE

GraphSAGE (Hamilton et al., 2017) focuses on learning aggregation functions rather than fixed convolutions. It's designed to be inductive, meaning it can generalize to unseen nodes during inference. GraphSAGE samples a fixed-size neighborhood for each node and then aggregates neighbor features using functions like mean, max, or LSTM pooling.

The core steps are:

Sample a neighborhood $\mathcal{N}(i)$ for node $i$ .
Aggregate features from neighbors: $\mathbf{a}_{\mathcal{N}(i)}^{(l)} = \text{AGGREGATE}^{(l)}(\{\mathbf{h}_j^{(l-1)} \mid j \in \mathcal{N}(i)\})$
Update node $i$ 's embedding: $\mathbf{h}_i^{(l)} = \sigma\left( \mathbf{W}^{(l)} \cdot \text{CONCAT}(\mathbf{h}_i^{(l-1)}, \mathbf{a}_{\mathcal{N}(i)}^{(l)}) \right)$

PyG implements this with torch_geometric.nn.SAGEConv:

from torch_geometric.nn import SAGEConv

class SimpleGraphSAGE(torch.nn.Module):
    def __init__(self, num_node_features, num_classes, hidden_channels):
        super().__init__()
        # Default aggregator is 'mean'
        self.conv1 = SAGEConv(num_node_features, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

You can specify the aggregator type (e.g., aggr='max', aggr='mean') when creating the SAGEConv layer.

Graph Attention Network (GAT)

GAT layers (Veličković et al., 2018) incorporate attention mechanisms, allowing nodes to assign different importance weights to their neighbors during aggregation. This makes the aggregation process more flexible and often leads to better performance.

Attention coefficients $e_{ij}$ between node $i$ and neighbor $j$ are calculated based on their features, typically using a shared linear transformation and an attention mechanism (e.g., a single-layer feedforward network):

e_{ij} = \text{attention}(\mathbf{W}^{(l)}\mathbf{h}_i^{(l-1)}, \mathbf{W}^{(l)}\mathbf{h}_j^{(l-1)})

These coefficients are then normalized using the softmax function across all neighbors of $i$ :

\alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}

The aggregated message is a weighted sum of transformed neighbor features:

\mathbf{a}_i^{(l)} = \sum_{j \in \mathcal{N}(i)} \alpha_{ij} \mathbf{W}^{(l)} \mathbf{h}_j^{(l-1)}

The update step combines this aggregated message, often using concatenation followed by activation:

\mathbf{h}_i^{(l)} = \sigma\left( \mathbf{a}_i^{(l)} \right) \quad \text{or} \quad \mathbf{h}_i^{(l)} = \sigma\left( \text{CONCAT}(\mathbf{h}_i^{(l-1)}, \mathbf{a}_i^{(l)}) \right)

GAT often employs multi-head attention, where multiple independent attention mechanisms are computed and their results are concatenated or averaged.

PyG implements this with torch_geometric.nn.GATConv:

from torch_geometric.nn import GATConv

class SimpleGAT(torch.nn.Module):
    def __init__(self, num_node_features, num_classes, hidden_channels, heads=8):
        super().__init__()
        # Use multi-head attention in the first layer
        self.conv1 = GATConv(num_node_features, hidden_channels, heads=heads, dropout=0.6)
        # The output features of multi-head attention is heads * hidden_channels
        # For the last layer, often average the heads or use a single head
        self.conv2 = GATConv(hidden_channels * heads, num_classes, heads=1, concat=False, dropout=0.6)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = F.dropout(x, p=0.6, training=self.training) # Dropout on input features
        x = self.conv1(x, edge_index)
        x = F.elu(x) # ELU activation is common in GAT
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

Building and Training GNN Models

Constructing a GNN in PyTorch using PyG layers follows standard PyTorch practices. You define a class inheriting from torch.nn.Module, initialize the PyG layers in __init__, and define the forward pass logic in forward. The forward method typically takes a Data or Batch object as input and extracts x, edge_index, and potentially edge_attr and batch indices.

Training loops are also similar to standard PyTorch loops: iterate through the DataLoader, perform the forward pass, calculate the loss (e.g., F.nll_loss for node classification with log_softmax), compute gradients with loss.backward(), and update parameters using an optimizer.

Common GNN Applications

GNNs are versatile and applied to various graph-related tasks:

Node Classification: Predict the label or properties of individual nodes in a graph (e.g., classifying users in a social network, predicting protein functions). The examples above (SimpleGCN, SimpleGraphSAGE, SimpleGAT) are structured for node classification.
Graph Classification: Predict a label or property for the entire graph (e.g., classifying molecules as toxic or non-toxic, categorizing social communities). This requires a graph pooling layer (e.g., torch_geometric.nn.global_mean_pool, global_max_pool) after the GNN layers to aggregate node embeddings into a single graph embedding.
Link Prediction: Predict whether an edge exists or will exist between two nodes (e.g., recommending friends in a social network, predicting protein-protein interactions). This often involves learning node embeddings and then using a scoring function (e.g., dot product) on pairs of node embeddings.
Graph Generation: Generate new graphs with desired properties.
Community Detection: Identify densely connected groups of nodes within a larger graph.

PyTorch Geometric provides a comprehensive toolkit for tackling these tasks. By combining its optimized layers, data handling utilities, and standard PyTorch features, you can effectively build and train sophisticated GNN models for complex graph-based problems. Remember that choosing the right GNN architecture (GCN, GAT, SAGE, or others) often depends on the specific characteristics of your graph data and the task at hand. Experimentation and understanding the underlying principles of each layer are important for successful application.

Was this section helpful?