All Courses

Graph Convolutional Networks (GCN)

Building upon the foundational concepts reviewed in the previous chapter, we now examine specific advanced Graph Neural Network architectures. While many innovative designs exist, the Graph Convolutional Network (GCN) remains a bedrock and a frequent starting point for understanding more complex models. Although introduced earlier in many GNN contexts, here we will analyze its spectral roots more formally and connect them to its widely used spatial implementation based on message passing.

From Spectral Filters to Spatial Aggregation

Recall from Chapter 1 that spectral graph convolutions operate in the Fourier domain, defined by the eigenvectors of the graph Laplacian $L$ . A spectral filter $g_\theta$ applied to a graph signal $x$ is defined as:

g_\theta * x = U g_\theta(\Lambda) U^T x

Here, $U$ contains the eigenvectors of $L = D - A$ , $\Lambda$ is the diagonal matrix of eigenvalues, and $g_\theta(\Lambda)$ applies the filter function element-wise to the eigenvalues. While theoretically elegant, this formulation suffers from two main drawbacks:

Computational Cost: Calculating the eigen-decomposition of $L$ for large graphs is computationally expensive, often $O(N^3)$ where $N$ is the number of nodes.
Non-Localization: The eigenvectors $U$ are generally dense, meaning the filter is not localized in the vertex domain; changing a single node's feature can influence the output across the entire graph.

To address these issues, approximations using polynomials in the Laplacian were proposed. ChebNet, for instance, uses Chebyshev polynomials $T_k$ to approximate the filter:

g_\theta(\Lambda) \approx \sum_{k=0}^K \theta_k T_k(\tilde{\Lambda})

where $\tilde{\Lambda}$ is a rescaled version of $\Lambda$ to lie within $[-1, 1]$ , the domain where Chebyshev polynomials are defined. This polynomial approximation makes the filter $K$ -localized, meaning it only depends on the $K$ -hop neighborhood of a node, and avoids the explicit computation of eigenvectors.

The GCN Simplification

The Graph Convolutional Network, as introduced by Kipf and Welling (2017), can be understood as a specific, highly effective simplification of this spectral approach. It essentially makes two important approximations:

Linear Filter: It restricts the polynomial filter to $K=1$ . This implies the filter considers only the immediate (1-hop) neighborhood. The filter function becomes linear with respect to the Laplacian's eigenvalues: $g_\theta(\Lambda) \approx \theta_0 I + \theta_1 \Lambda$ .
Parameter Reduction: It further simplifies by setting $\theta_0 = -\theta_1 = \theta$ , reducing the filter to $g_\theta(\Lambda) \approx \theta (I - \Lambda)$ . However, practical GCNs use a slightly different formulation based on a renormalized adjacency matrix.

The main step in the practical GCN formulation involves a renormalization trick. Instead of directly using the Laplacian $L = I - D^{-1/2} A D^{-1/2}$ (normalized Laplacian), GCN employs a modified adjacency matrix $\hat{A}$ :

\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}

where $\tilde{A} = A + I$ is the adjacency matrix with self-loops added, and $\tilde{D}$ is the diagonal degree matrix of $\tilde{A}$ (i.e., $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$ ).

Why this specific form?

Adding Self-Loops ( $A+I$ ): Ensures that a node's own features are included in the aggregation process during message passing. Without this, a node's updated representation would solely depend on its neighbors.
Symmetric Normalization ( $\tilde{D}^{-1/2} \dots \tilde{D}^{-1/2}$ ): This normalization prevents the scale of node feature vectors from changing drastically based on node degrees during aggregation. Multiplying by $A$ would sum neighbor features, causing high-degree nodes to have much larger feature values. Dividing by $\tilde{D}$ (as in $\tilde{D}^{-1}\tilde{A}$ ) averages neighbor features but can lead to other numerical instabilities. The symmetric normalization $\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$ balances these effects and has favorable spectral properties. Its eigenvalues are bounded within $[0, 2]$ when derived from $I+D^{-1/2}AD^{-1/2}$ (assuming $A$ has self-loops or using $A+I$ originally), which helps stabilize training, particularly in deeper networks.

The GCN Layer Propagation Rule

Combining these elements, the layer-wise propagation rule for a GCN takes a remarkably simple form. Given node features $H^{(l)} \in \mathbb{R}^{N \times F_l}$ at layer $l$ (where $N$ is the number of nodes and $F_l$ is the feature dimension), the features $H^{(l+1)} \in \mathbb{R}^{N \times F_{l+1}}$ at the next layer are computed as:

H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)})

Let's break this down:

$H^{(l)}$ : The matrix of node embeddings from the previous layer (or initial node features for $l=0$ ).
$W^{(l)} \in \mathbb{R}^{F_l \times F_{l+1}}$ : A layer-specific trainable weight matrix. This matrix transforms the features of each node.
$\hat{A} \in \mathbb{R}^{N \times N}$ : The pre-computed, normalized adjacency matrix with self-loops. Multiplication by $\hat{A}$ performs the core neighborhood aggregation. Specifically, $(\hat{A} X)_i = \sum_{j \in \mathcal{N}(i) \cup \{i\}} \frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}} X_j$ , where $X = H^{(l)} W^{(l)}$ and $\tilde{d}_i$ is the degree of node $i$ in $\tilde{A}$ .
$\sigma(\cdot)$ : An element-wise non-linear activation function, such as ReLU or ELU.

Multiple GCN layers can be stacked to allow information to propagate across larger distances in the graph. A typical two-layer GCN for semi-supervised node classification might look like:

Z = \text{softmax}(\hat{A} \text{ ReLU}(\hat{A} X W^{(0)}) W^{(1)})

where $X$ are the input features, $W^{(0)}$ and $W^{(1)}$ are the weight matrices, and $Z$ contains the output class probabilities for each node.

GCN as Message Passing

The GCN propagation rule fits neatly into the message passing framework discussed in Chapter 1. For a single node $i$ , the update can be seen as:

Transformation: Each node $j$ (including $i$ itself due to the self-loop in $\tilde{A}$ ) transforms its feature vector $h_j^{(l)}$ using the weight matrix: $m_j^{(l)} = h_j^{(l)} W^{(l)}$ .
Aggregation: Node $i$ aggregates the transformed messages from its neighbors (and itself), weighted by the normalization constants from $\hat{A}$ : $a_i^{(l)} = \sum_{j \in \mathcal{N}(i) \cup \{i\}} \frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}} m_j^{(l)}$ This aggregation is implicitly performed by the matrix multiplication $\hat{A} (H^{(l)} W^{(l)})$ .
Update: Apply the non-linear activation function: $h_i^{(l+1)} = \sigma(a_i^{(l)})$

This perspective highlights that GCN performs a specific, fixed form of neighborhood aggregation determined by the graph structure ( $\hat{A}$ ) followed by a shared linear transformation ( $W^{(l)}$ ) and non-linearity.

Message passing view of a GCN layer update for node $i$ . Features from neighbors and the node itself ( $h_j, h_i$ ) are transformed by $W^{(l)}$ , aggregated using weights derived from the normalized adjacency matrix $\hat{A}$ , and then passed through a non-linearity $\sigma$ .

Strengths and Limitations

GCNs gained popularity due to their simplicity, efficiency (compared to earlier spectral methods), and strong performance on benchmark tasks like semi-supervised node classification on citation networks (Cora, Citeseer, Pubmed). They provide an effective baseline model.

However, GCNs have limitations:

Limited Expressivity: The fixed aggregation scheme based on $\hat{A}$ treats all neighbors equally (after normalization) and limits the model's ability to distinguish certain graph structures, as discussed in relation to the WL test in Chapter 1.
Oversmoothing: Stacking many GCN layers tends to make node representations converge towards a common value, losing discriminative power. The repeated averaging across neighborhoods smooths out node features.
Homophily Assumption: GCNs implicitly work best under the assumption of homophily, where connected nodes tend to have similar features or labels. The averaging mechanism reinforces this. Performance can degrade on graphs with significant heterophily (where connected nodes are often dissimilar).
Inability to Handle Weighted Edges Natively: The standard GCN formulation uses the unweighted adjacency matrix $A$ . While modifications exist, the base model doesn't inherently incorporate edge weights or features into the aggregation weighting.

These limitations motivate the development of more sophisticated architectures. For instance, Graph Attention Networks (GATs), discussed next, introduce learnable attention mechanisms to assign different importance weights to neighbors during aggregation, directly addressing the fixed weighting scheme of GCNs. Other architectures tackle oversmoothing or adapt GNNs for different graph types and tasks, as we will discuss throughout this course. Understanding the spectral origins and practical message-passing implementation of GCN provides a foundation for appreciating these subsequent advancements.

Was this section helpful?