Evaluation Metrics for Node Classification

A trained Graph Neural Network (GNN) relies on its loss function for guiding the adjustment of model weights towards an effective configuration. However, a low loss value achieved on the training set does not always directly translate to good performance on unseen data. To truly understand how well a model generalizes, it is evaluated on a held-out test set using metrics that are more interpretable than raw loss values. For node classification, these metrics answer a simple question: How many nodes did the model label correctly?

From Raw Outputs to Concrete Predictions

A GNN for node classification typically ends with a final linear layer that produces raw, unnormalized scores for each class, often called logits. If your model has $C$ classes and you are evaluating $N$ nodes, the output will be a tensor of shape $[N, C]$ . To turn these scores into a definite prediction, we select the class with the highest score for each node. This is done using an argmax operation along the class dimension.

For example, if the output for a single node is [0.1, 2.5, -1.3], the argmax is 1, meaning the model predicts the second class (since indexing is zero-based). This process converts the model's continuous output scores into discrete class labels that we can compare against the ground-truth labels.

The Confusion Matrix: A Deeper Look at Performance

Before we can calculate any metrics, we must first compare our model's predictions to the true labels. The most fundamental tool for organizing this comparison is the confusion matrix. In a binary classification task, the confusion matrix is a 2x2 table that summarizes the four possible outcomes for a prediction.

The four outcomes in a binary confusion matrix. Correct predictions (TP, TN) are on the main diagonal.

For node classification, which is typically a multi-class problem, this extends to a $C \times C$ matrix where $C$ is the number of classes. The entry at row $i$ and column $j$ is the count of nodes that truly belong to class $i$ but were predicted to be class $j$ . The diagonal elements represent all the correctly classified nodes.

Accuracy

Accuracy is the most intuitive metric. It measures the proportion of total predictions that were correct.

\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}

While simple to understand, accuracy can be misleading, especially on datasets with class imbalance. Imagine a graph where 95% of nodes belong to Class A and 5% belong to Class B. A lazy model that always predicts Class A will achieve 95% accuracy without learning anything useful about Class B. In such cases, we need more discerning metrics.

Precision, Recall, and the F1 Score

To get a better picture of performance on imbalanced data, we turn to precision, recall, and the F1 score. These are typically calculated on a per-class basis. For a given class, we designate it as the "positive" class and all other classes as "negative".

Precision

Precision answers the question: "Of all the nodes the model labeled as Class A, how many actually were Class A?" It measures the reliability of the model's positive predictions.

\text{Precision} = \frac{TP}{TP + FP}

High precision is important when the cost of a false positive is high. For example, in a system that automatically flags scientific papers as "retracted," you want high precision to avoid incorrectly flagging legitimate papers.

Recall

Recall (also known as sensitivity or true positive rate) answers the question: "Of all the nodes that truly are Class A, how many did the model find?" It measures the model's ability to identify all relevant instances.

\text{Recall} = \frac{TP}{TP + FN}

High recall is important when the cost of a false negative is high. For example, in a GNN that predicts which proteins are associated with a disease, you want high recall to avoid missing any potentially significant proteins.

F1 Score

Often, there is a trade-off between precision and recall. The F1 score provides a way to combine them into a single metric. It is the harmonic mean of precision and recall, which gives more weight to lower values. This means the F1 score will be high only if both precision and recall are high.

\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Averaging for Multi-Class Problems

Since precision, recall, and F1 are calculated per class, we need a strategy to aggregate them into a single number for our multi-class node classification task. The two most common methods are macro and weighted averaging.

Macro Average: Calculate the metric independently for each class and then compute the unweighted average. This treats every class as equally important, regardless of how many nodes it contains. This is a good measure if you want to know how the model performs on all classes, including rare ones.
Weighted Average: Calculate the metric for each class, but when averaging, weight each class's score by its support (the number of true instances for that class). This accounts for class imbalance. A high weighted average F1 score indicates good performance on the most common classes.

On an imbalanced dataset, a high weighted average score can hide poor performance on minority classes. The lower macro average score reveals this weakness.

Choosing the Right Metric

The choice of evaluation metric depends entirely on your application's goals.

For balanced datasets or when you care equally about all errors, accuracy is a good starting point.
If your dataset is imbalanced and you want to ensure the model performs well even on rare classes, use the macro-averaged F1 score.
If the dataset is imbalanced but you care more about performance on the larger, more influential classes, the weighted-averaged F1 score is more appropriate.

In practice, it is often useful to look at a full classification report, which shows precision, recall, and F1 score for each class individually. Libraries like scikit-learn provide convenient functions for this.

from sklearn.metrics import classification_report

# y_true: Ground truth labels (e.g., from data.test_mask)
# y_pred: Model's predicted labels on the test set

# Assuming class_names is a list of strings for your labels
print(classification_report(y_true, y_pred, target_names=class_names))

#              precision    recall  f1-score   support

#    Class 1       0.91      0.95      0.93       105
#    Class 2       0.75      0.82      0.78        80
#    Class 3       0.98      0.96      0.97       150
#
#   accuracy                           0.92       335
#  macro avg       0.88      0.91      0.89       335
# weighted avg     0.92      0.92      0.92       335

This detailed report gives you a comprehensive view of your model's strengths and weaknesses, allowing you to make informed decisions about how to improve it.

Was this section helpful?

References

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, 2009 (Springer) DOI: 10.1007/978-0-387-84858-7 - This book provides a rigorous statistical and mathematical foundation for machine learning algorithms and their evaluation, covering concepts like classification, error estimation, and model selection.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurélien Géron, 2022 (O'Reilly Media) - This practical guide explains machine learning concepts, including detailed discussions and examples of classification evaluation metrics using popular Python libraries like scikit-learn.
scikit-learn User Guide: 3.3. Metrics and scoring: quantifying the quality of predictions, scikit-learn developers, 2024 - The official documentation provides comprehensive explanations and examples for various classification evaluation metrics, including the classification_report and different averaging strategies, directly relevant to the Python code snippet.