Data Splitting in Graphs: Transductive vs. Inductive

When you split data for traditional machine learning models, like a classifier for images, the process is straightforward. You randomly shuffle your dataset and divide it into training, validation, and test sets. This works because each data point, an image in this case, is assumed to be independent and identically distributed (i.i.d.). One image does not influence another.

In graph data, this assumption breaks down completely. Nodes are defined by their features and their connections to other nodes. The very essence of a GNN is to leverage these connections. If you randomly assign nodes to different sets, you risk creating a situation where the model is directly or indirectly trained on information from the test set, a problem known as data leakage. For example, a test node's features might be updated using information from a neighboring training node. This makes the evaluation unreliable and gives an overly optimistic measure of performance.

To handle this dependency, we use two distinct settings for splitting graph data: transductive and inductive.

The Transductive Setting: Reasoning Within a Known Graph

In the transductive setting, we have access to the entire graph during training. This means all nodes and all edges are visible to the model from the start. However, we only have access to the labels for a subset of the nodes, which form our training set. The goal is to infer the labels for the remaining nodes in the same graph.

Think of it as filling in the blanks on a map you already have. You can see all the cities and roads, but only some cities are labeled with their population. Your task is to predict the population for the unlabeled cities.

During training, the GNN can pass messages across the entire graph structure. When we calculate the loss, however, we only evaluate the model's predictions for the nodes in the training set. The validation and test sets consist of other nodes within that same graph, and we use them to evaluate how well our model generalizes to its unseen neighbors.

How it works:

Single Graph: You work with one complete graph, containing all nodes and edges.
Node Masking: Instead of creating separate data files, you create boolean masks. A train_mask indicates which nodes to use for loss calculation and backpropagation. A val_mask and test_mask indicate which nodes to use for evaluation.
Full Structure Awareness: The model's layers operate using the full adjacency matrix, allowing information to flow from training nodes to validation and test nodes during the forward pass.

A transductive split on a single graph. The model sees all nodes and edges. It trains on the labeled nodes (green) and is evaluated on its predictions for other nodes in the same graph (yellow for validation, red for test).

Many classic node classification benchmarks, such as the Cora citation network, are evaluated using this setting. The task is to classify academic papers (nodes) given the full citation network (edges).

The Inductive Setting: Generalizing to Unseen Data

In the inductive setting, the model is trained on one set of nodes or graphs and is then expected to make predictions on new, completely unseen nodes or graphs. This is much closer to the standard machine learning workflow and is necessary for most production applications where new data arrives continuously.

Imagine training a model to detect fraudulent transactions. You would train it on transaction graphs from past weeks. The goal is to deploy this model to detect fraud in next week's transaction graph, which will contain new customers and new interactions. The model cannot assume it has already seen the nodes it needs to make predictions on.

To achieve this, the training, validation, and test sets must be strictly separated. If you are splitting a single large graph, this means the validation and test nodes, along with all their connecting edges, must be completely removed from the graph seen by the model during training.

How it works:

Disjoint Graphs: The training data is one or more graphs, and the test data is a set of completely separate graphs or nodes that were held out.
Generalizable Function: The GNN learns a function that generates node embeddings based on a node's local neighborhood structure and features. This function can then be applied to any new node.
True Generalization: Evaluation on the test set measures the model's ability to generalize its learned patterns to entirely new data, not just to another part of the same graph.

An inductive split. The model is trained on a set of graphs (left) and evaluated on a new, unseen graph (right). The test graph's structure and nodes were not available during training.

Architectures like GraphSAGE, which use neighborhood sampling, are particularly well-suited for inductive learning because they are explicitly designed to generate embeddings for any node, regardless of whether it was seen during training.

Transductive vs. Inductive: A Summary

Aspect	Transductive Learning	Inductive Learning
Graph Access	The model sees the entire graph structure during training.	The model only sees the training graph(s).
Task Goal	Infer labels for unlabeled nodes within a fixed graph.	Generalize to make predictions on entirely new graphs or nodes.
Evaluation Data	Unlabeled nodes from the same graph used for training.	Completely new nodes or graphs unseen during training.
Common Use Case	Semi-supervised node classification on a single network.	Fraud detection, molecular property prediction, product recommendations.
Example Models	GCN (original formulation)	GraphSAGE, GAT (can be used in both settings)

Why This Distinction Matters for Model Building

The choice between a transductive and an inductive setup depends entirely on your problem.

If you have a fixed, static graph and your goal is to infer missing information within it, a transductive approach is appropriate. The model can leverage the global structure of the graph to make more informed predictions.
If you need to deploy a model that will continuously make predictions on new data, such as new users joining a platform or new molecules being synthesized, you must use an inductive approach. This proves your model has learned generalizable patterns rather than just memorizing properties of a single graph.

When implementing your training pipeline, be mindful of this distinction. Using node masks for splitting is a sign of a transductive setup. Creating truly separate graph objects for your train, validation, and test sets is necessary for an inductive setup. Misinterpreting the setting can lead to data leakage and a model that fails when deployed.

Was this section helpful?

References

Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf and Max Welling, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1609.02907 - Introduces Graph Convolutional Networks (GCNs) for semi-supervised node classification, often applied in a transductive setting.
Inductive Representation Learning on Large Graphs, William L. Hamilton, Rex Ying, and Jure Leskovec, 2017 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.1706.02216 - Presents GraphSAGE, an algorithm for inductive representation learning on large graphs, designed to generate embeddings for unseen nodes.
Graph Representation Learning, William L. Hamilton, 2020 (Morgan & Claypool Publishers) DOI: 10.2200/S01004ED1V01Y202003AIM002 - Provides a comprehensive treatment of graph representation learning, with dedicated sections on transductive and inductive learning paradigms in GNNs.