While Graph Convolutional Networks (GCNs) provide a powerful framework for learning on graphs, their aggregation mechanism often treats all neighbors with uniform importance (adjusted only by node degrees). Consider a social network graph where you want to predict a user's interests. Some friends might be much more influential or share more similar interests than others. A standard GCN might struggle to capture these nuances effectively because its convolution kernel weights are fixed based on the graph structure (specifically, the normalized adjacency or Laplacian matrix).
Graph Attention Networks (GATs) introduce a way to overcome this limitation by incorporating self-attention mechanisms, inspired by their success in sequence-based tasks. The central idea is to allow nodes to assign different levels of importance, or attention, to different nodes within their neighborhood during the aggregation process. These attention weights are not fixed; they are computed dynamically based on the features of the interacting nodes.
Instead of using pre-defined weights like 1/deg(i)deg(j) found in GCN, GAT computes attention coefficients, denoted as aij, for each edge (j,i) indicating the importance of node j's features to node i. This process typically involves three steps within a GAT layer:
Linear Transformation: First, the input features hi∈RF for each node i are transformed by a learnable weight matrix W∈RF′×F, resulting in higher-level features zi=Whi∈RF′. This initial transformation increases the model's capacity to learn relevant representations.
Attention Coefficient Calculation: For each edge connecting node j to node i, an attention score eij is computed. This score signifies the un-normalized importance of node j's features to node i. A common way to compute eij is using a shared attention mechanism, typically a single-layer feedforward neural network parameterized by a learnable weight vector a∈R2F′. The input to this mechanism is the concatenation of the transformed features of the two nodes, zi and zj:
eij=LeakyReLU(aT[Whi∣∣Whj])Here, ∣∣ denotes concatenation, and LeakyReLU is often used as the non-linearity. This mechanism is shared across all edges, meaning the same a and W are used to compute attention scores for all node pairs.
Normalization via Softmax: The raw attention scores eij are not directly comparable across different nodes. To make them comparable and obtain normalized attention coefficients aij, a softmax function is applied across all neighbors j of node i (including the node i itself, usually):
aij=softmaxj(eij)=∑k∈Ni∪{i}exp(eik)exp(eij)where Ni represents the neighborhood of node i. These normalized coefficients aij now sum to 1 for each node i across all its neighbors j, forming a weighted distribution of attention.
Weighted Aggregation: Finally, the output features hi′∈RF′ for node i are computed as a weighted sum of the transformed features of its neighbors, using the normalized attention coefficients aij as the weights. A non-linear activation function σ (like ReLU or ELU) is typically applied afterward:
hi′=σj∈Ni∪{i}∑aijWhjThis process effectively allows each node to selectively focus on its most relevant neighbors when updating its representation.
A view of the GAT aggregation process for node 'i'. Attention coefficients aij determine the weight of messages passed from neighbors (including self) to update the node's feature vector hi′.
Graph Attention Networks offer several benefits compared to simpler convolutional approaches:
While a single GAT layer computes one set of attention weights, this mechanism can be extended. A common practice is to use multi-head attention, where multiple independent attention mechanisms are computed in parallel, and their results are combined. This technique, discussed in the next section, often leads to more stable training and allows the model to capture different types of relationships within the neighborhood simultaneously.
© 2025 ApX Machine Learning