Stacking Layers to Form a Deep GNN

A single Graph Neural Network (GNN) layer processes node features by examining their immediate neighbors. While this layer is a powerful building block, its perspective is limited to a 1-hop neighborhood. To capture information from more distant parts of the graph, these GNN layers are stacked, forming a deep Graph Neural Network. This layering technique mirrors how standard deep neural networks create hierarchies of features.

Expanding the Receptive Field

The primary reason for stacking GNN layers is to expand each node's receptive field. A node's receptive field is the set of nodes in the graph that can influence its final representation.

After 1 layer: A node's embedding, $\mathbf{h}_v^{(1)}$ , is a function of its own initial features and the initial features of its 1-hop neighbors. Its receptive field is its immediate neighborhood.
After 2 layers: To compute $\mathbf{h}_v^{(2)}$ , the GNN uses the embeddings from the previous layer, $\mathbf{h}_u^{(1)}$ for all neighbors $u \in \mathcal{N}(v)$ . But each $\mathbf{h}_u^{(1)}$ was itself computed from the initial features of $u$ 's neighbors. Therefore, after two layers, information from a node's 2-hop neighbors has propagated to it.

With each additional GNN layer, the receptive field of every node expands by one hop. A GNN with $K$ layers can therefore propagate information between nodes that are up to $K$ hops apart. This allows the model to learn features based on larger sub-structures within the graph.

The diagram below illustrates this process. To compute the final representation for Node A after two layers, the model first aggregates information from its 1-hop neighbors (B and C) in the first layer. In the second layer, it aggregates the updated representations of B and C. Because the representations of B and C already contain information from their neighbors (D and E, respectively), Node A's final representation is influenced by its 2-hop neighbors.

The flow of information to Node A over two GNN layers. After the second layer, Node A's representation incorporates information from nodes D and E, which are two hops away.

Mathematical Formulation of a Deep GNN

Formally, a multi-layer GNN works by composing several message passing layers. The output embeddings of layer $l$ , denoted as $\mathbf{H}^{(l)}$ , become the input embeddings for layer $l+1$ . The process starts with the initial node features, $\mathbf{H}^{(0)} = \mathbf{X}$ .

For a node $v$ , the computation for the first two layers proceeds as follows:

Input Layer: The process begins with the raw node features. $\mathbf{h}_v^{(0)} = \mathbf{x}_v$
First GNN Layer: This layer computes the first set of hidden representations by aggregating information from 1-hop neighbors. $\mathbf{h}_v^{(1)} = \text{UPDATE}^{(1)} \left( \mathbf{h}_v^{(0)}, \text{AGGREGATE}^{(1)} \left( \{ \mathbf{h}_u^{(0)} : u \in \mathcal{N}(v) \} \right) \right)$
Second GNN Layer: This layer takes the embeddings from the first layer, $\mathbf{h}^{(1)}$ , as input and performs another round of message passing. $\mathbf{h}_v^{(2)} = \text{UPDATE}^{(2)} \left( \mathbf{h}_v^{(1)}, \text{AGGREGATE}^{(2)} \left( \{ \mathbf{h}_u^{(1)} : u \in \mathcal{N}(u) \} \right) \right)$

This process is repeated for $K$ layers. The AGGREGATE and UPDATE functions for each layer typically have their own set of trainable parameters (e.g., weight matrices), allowing the network to learn different feature transformations at different depths. The final output of the $K$ -th layer, $\mathbf{h}_v^{(K)}$ , is the node embedding used for the downstream task.

A Word of Caution: The Over-Smoothing Problem

While depth allows GNNs to access a wider graph context, there is a significant drawback to making them too deep: over-smoothing. Over-smoothing is a phenomenon where, after many message passing iterations, the representations of all nodes in a connected graph converge to a similar value.

Think of it like dropping a bit of colored dye into a pool of water. After one stir (one GNN layer), the color spreads to its immediate vicinity. After many stirs, the dye diffuses evenly throughout the entire pool, making it impossible to tell where the dye originated. Similarly, as node features are repeatedly averaged with their neighbors, they lose their initial, distinguishing information.

When node embeddings become indistinguishable, the model's performance on tasks like node classification degrades significantly, as it can no longer tell the nodes apart. Because of over-smoothing, most GNN architectures used in practice are relatively shallow, often consisting of only 2 to 4 layers. Mitigating this issue is an active area of GNN research, leading to more complex architectures with skip connections or other mechanisms to preserve initial node information.

Understanding this layered structure is fundamental. In the next chapter, we will examine specific, influential architectures like GCN and GraphSAGE that define concrete forms for the AGGREGATE and UPDATE functions within this multi-layer framework.

Was this section helpful?

References

Semi-Supervised Classification with Graph Convolutional Networks, Thomas N. Kipf and Max Welling, 2017 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1609.02907 - Presents the Graph Convolutional Network (GCN) architecture, demonstrating how to stack message passing layers to derive node representations in graphs.
Graph Representation Learning, William L. Hamilton, 2020 Synthesis Lectures on Artificial Intelligence and Machine Learning, Vol. 14, No. 3 (Morgan & Claypool Publishers) DOI: 10.2200/S01045ED1V01Y202009AIM046 - Offers a systematic discussion of graph representation learning, including message passing, layer stacking for expanded receptive fields, and over-smoothing.
CS224W: Machine Learning with Graphs, Jure Leskovec, 2025 - A leading university course covering graph neural networks, message passing, node receptive fields, and practical concerns like over-smoothing, with clear explanations.