A trained Graph Neural Network (GNN) relies on its loss function for guiding the adjustment of model weights towards an effective configuration. However, a low loss value achieved on the training set does not always directly translate to good performance on unseen data. To truly understand how well a model generalizes, it is evaluated on a held-out test set using metrics that are more interpretable than raw loss values. For node classification, these metrics answer a simple question: How many nodes did the model label correctly?
A GNN for node classification typically ends with a final linear layer that produces raw, unnormalized scores for each class, often called logits. If your model has C classes and you are evaluating N nodes, the output will be a tensor of shape [N,C]. To turn these scores into a definite prediction, we select the class with the highest score for each node. This is done using an argmax operation along the class dimension.
For example, if the output for a single node is [0.1, 2.5, -1.3], the argmax is 1, meaning the model predicts the second class (since indexing is zero-based). This process converts the model's continuous output scores into discrete class labels that we can compare against the ground-truth labels.
Before we can calculate any metrics, we must first compare our model's predictions to the true labels. The most fundamental tool for organizing this comparison is the confusion matrix. In a binary classification task, the confusion matrix is a 2x2 table that summarizes the four possible outcomes for a prediction.
The four outcomes in a binary confusion matrix. Correct predictions (TP, TN) are on the main diagonal.
For node classification, which is typically a multi-class problem, this extends to a C×C matrix where C is the number of classes. The entry at row i and column j is the count of nodes that truly belong to class i but were predicted to be class j. The diagonal elements represent all the correctly classified nodes.
Accuracy is the most intuitive metric. It measures the proportion of total predictions that were correct.
Accuracy=Total Number of PredictionsNumber of Correct Predictions=TP+TN+FP+FNTP+TNWhile simple to understand, accuracy can be misleading, especially on datasets with class imbalance. Imagine a graph where 95% of nodes belong to Class A and 5% belong to Class B. A lazy model that always predicts Class A will achieve 95% accuracy without learning anything useful about Class B. In such cases, we need more discerning metrics.
To get a better picture of performance on imbalanced data, we turn to precision, recall, and the F1 score. These are typically calculated on a per-class basis. For a given class, we consider it the "positive" class and all other classes as "negative".
Precision answers the question: "Of all the nodes the model labeled as Class A, how many actually were Class A?" It measures the reliability of the model's positive predictions.
Precision=TP+FPTPHigh precision is important when the cost of a false positive is high. For example, in a system that automatically flags scientific papers as "retracted," you want high precision to avoid incorrectly flagging legitimate papers.
Recall (also known as sensitivity or true positive rate) answers the question: "Of all the nodes that truly are Class A, how many did the model find?" It measures the model's ability to identify all relevant instances.
Recall=TP+FNTPHigh recall is important when the cost of a false negative is high. For example, in a GNN that predicts which proteins are associated with a disease, you want high recall to avoid missing any potentially significant proteins.
Often, there is a trade-off between precision and recall. The F1 score provides a way to combine them into a single metric. It is the harmonic mean of precision and recall, which gives more weight to lower values. This means the F1 score will be high only if both precision and recall are high.
F1 Score=2⋅Precision+RecallPrecision⋅RecallSince precision, recall, and F1 are calculated per class, we need a strategy to aggregate them into a single number for our multi-class node classification task. The two most common methods are macro and weighted averaging.
Macro Average: Calculate the metric independently for each class and then compute the unweighted average. This treats every class as equally important, regardless of how many nodes it contains. This is a good measure if you want to know how the model performs on all classes, including rare ones.
Weighted Average: Calculate the metric for each class, but when averaging, weight each class's score by its support (the number of true instances for that class). This accounts for class imbalance. A high weighted average F1 score indicates good performance on the most common classes.
On an imbalanced dataset, a high weighted average score can hide poor performance on minority classes. The lower macro average score reveals this weakness.
The choice of evaluation metric depends entirely on your application's goals.
In practice, it is often useful to look at a full classification report, which shows precision, recall, and F1 score for each class individually. Libraries like scikit-learn provide convenient functions for this.
from sklearn.metrics import classification_report
# y_true: Ground truth labels (e.g., from data.test_mask)
# y_pred: Model's predicted labels on the test set
# Assuming class_names is a list of strings for your labels
print(classification_report(y_true, y_pred, target_names=class_names))
# precision recall f1-score support
# Class 1 0.91 0.95 0.93 105
# Class 2 0.75 0.82 0.78 80
# Class 3 0.98 0.96 0.97 150
#
# accuracy 0.92 335
# macro avg 0.88 0.91 0.89 335
# weighted avg 0.92 0.92 0.92 335
This detailed report gives you a comprehensive view of your model's strengths and weaknesses, allowing you to make informed decisions about how to improve it.
Was this section helpful?
classification_report and different averaging strategies, directly relevant to the Python code snippet.© 2026 ApX Machine LearningEngineered with