Hands-on: Loading and Inspecting a Graph Dataset

Theory provides the blueprint, but a solid understanding comes from working with data directly. A classic graph dataset is loaded and its properties examined using the Python library NetworkX, illustrating abstract representations of graphs such as adjacency and feature matrices. This process connects mathematical definitions with their practical implementation.

Setting Up Your Environment

Before we begin, you will need to install NetworkX and matplotlib for visualization. NetworkX is a powerful library for creating, manipulating, and studying the structure and dynamics of complex networks. You can install these packages using pip:

pip install networkx matplotlib

With the environment ready, we can proceed to load and inspect our first graph.

Zachary's Karate Club: A Classic Graph Dataset

We will use a well-known social network dataset called "Zachary's Karate Club". This graph represents the social relationships between 34 members of a university karate club in the 1970s. A conflict between the club's administrator and its instructor led the club to split into two factions. The graph's edges represent friendships outside the club, and the task is often to predict which faction each member joined after the split.

Fortunately, this dataset is included with NetworkX, making it very easy to load.

import networkx as nx
import matplotlib.pyplot as plt

# Load the Zachary's Karate Club graph
G = nx.karate_club_graph()

# Print some basic information about the graph
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

Executing this code will produce the following output:

Number of nodes: 34
Number of edges: 78

This tells us our graph has 34 nodes (club members) and 78 edges (friendships).

Exploring Nodes and Their Attributes

In NetworkX, nodes are more than just numbers. They can hold attributes, or features, that contain information about them. In the Karate Club graph, each node has a 'club' attribute indicating which faction the member joined ("Mr. Hi" or "Officer").

Let's inspect node 0 (the instructor) and node 33 (the administrator).

# Access attributes of a specific node
node_0_data = G.nodes[0]
print(f"Node 0 data: {node_0_data}")

# The club attribute represents the faction
print(f"Node 0 joined the '{node_0_data['club']}' faction.")
print(f"Node 33 joined the '{G.nodes[33]['club']}' faction.")

Output:

Node 0 data: {'club': 'Mr. Hi'}
Node 0 joined the 'Mr. Hi' faction.
Node 33 joined the 'Officer' faction.

This 'club' attribute is the ground-truth label we would try to predict in a node classification task. A GNN would use the graph's structure and potentially other node features to make these predictions.

Visualizing the Graph Structure

A great way to build intuition for a graph is to visualize it. We can use matplotlib along with NetworkX's drawing capabilities to plot the graph.

A small portion of the graph might look something like this, where nodes 0 and 33 are the central figures of the two factions.

The Karate Club graph showing connections between members of the two factions led by node 0 ("Mr. Hi") and node 33 ("Officer").

Now, let's plot the entire graph. We can make our visualization more informative by coloring the nodes according to their club affiliation. This will give us a clear visual of the two communities.

# Create a color map based on the 'club' attribute
colors = []
for node in G:
    if G.nodes[node]['club'] == 'Mr. Hi':
        colors.append('#9775fa')  # Violet
    else:
        colors.append('#ffa94d')  # Orange

# Draw the graph
plt.figure(figsize=(8, 6))
nx.draw_spring(G, with_labels=True, node_color=colors, node_size=500)
plt.title("Zachary's Karate Club Social Network")
plt.show()

This script generates a plot where the two factions are clearly visible. You can see how the friendships (edges) tend to cluster within the two groups, with fewer connections between them. This structure is precisely what a GNN learns from.

From Graph Object to Matrix Representations

As we discussed previously, GNNs don't operate on graph objects directly. They require numerical tensors: an adjacency matrix $A$ and a node feature matrix $X$ . NetworkX makes it simple to derive the adjacency matrix.

import numpy as np

# Get the adjacency matrix as a NumPy array
A = nx.to_numpy_array(G)

print("Adjacency Matrix Shape:", A.shape)
print("A few entries of A:")
print(A[:5, :5])

Output:

Adjacency Matrix Shape: (34, 34)
A few entries of A:
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

The shape of the adjacency matrix is $(34, 34)$ , or $N \times N$ , where $N$ is the number of nodes. A 1 at $A[i, j]$ indicates an edge between node $i$ and node $j$ , while a 0 indicates no direct connection.

What about the node feature matrix $X$ ? For this particular dataset, explicit node features are not provided. In such cases, a common strategy is to create features from the graph's structure itself. For example, we could use each node's degree (the number of connections it has) as a simple feature. Another approach is to use an identity matrix, which assigns each node a unique one-hot encoded vector.

For the purpose of this example, let's create a simple feature matrix where each node's feature is just its degree.

# Create a feature matrix where each feature is the node's degree
degrees = [G.degree(n) for n in G.nodes()]
X = np.array(degrees).reshape(-1, 1)

print("Feature Matrix Shape:", X.shape)
print("First 5 features (node degrees):")
print(X[:5])

Output:

Feature Matrix Shape: (34, 1)
First 5 features (node degrees):
[[16]
 [ 9]
 [10]
 [ 6]
 [ 3]]

We now have our graph represented by two matrices: $A$ , which describes the structure, and $X$ , which describes the properties of the nodes. These are the fundamental inputs required by most GNN models. With this foundation, we are ready to explore how a GNN uses these matrices to learn from graph data.

Was this section helpful?

References

NetworkX Documentation, NetworkX Developers, 2024 - Official documentation for the NetworkX Python library, detailing graph creation, manipulation, analysis, and visualization functions.
An Information Flow Model for Conflict and Fission in Small Groups, Wayne W. Zachary, 1977 Journal of Anthropological Research, Vol. 33 (University of New Mexico) DOI: 10.1086/jar.33.4.3629752 - The original research paper introducing the Zachary's Karate Club dataset, describing the social network and the observed club split.
Networks: An Introduction, Mark E. J. Newman, 2018 (Oxford University Press) DOI: 10.1093/oso/9780198805090.001.0001 - A comprehensive textbook on network science, covering graph theory, network properties, and matrix representations relevant for graph datasets.
Graph Neural Networks: Foundations, Frontiers, and Applications, Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao, 2022 (Springer Nature) DOI: 10.1007/978-981-16-6054-2 - A textbook offering a broad overview of Graph Neural Networks, explaining graph representations and data structures in the context of GNNs.

Hands-on: Loading and Inspecting a Graph Dataset

Setting Up Your Environment

pip install networkx matplotlib

With the environment ready, we can proceed to load and inspect our first graph.

Zachary's Karate Club: A Classic Graph Dataset

Fortunately, this dataset is included with NetworkX, making it very easy to load.

import networkx as nx
import matplotlib.pyplot as plt

# Load the Zachary's Karate Club graph
G = nx.karate_club_graph()

# Print some basic information about the graph
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

Executing this code will produce the following output:

Number of nodes: 34
Number of edges: 78

This tells us our graph has 34 nodes (club members) and 78 edges (friendships).

Exploring Nodes and Their Attributes

Let's inspect node 0 (the instructor) and node 33 (the administrator).

# Access attributes of a specific node
node_0_data = G.nodes[0]
print(f"Node 0 data: {node_0_data}")

# The club attribute represents the faction
print(f"Node 0 joined the '{node_0_data['club']}' faction.")
print(f"Node 33 joined the '{G.nodes[33]['club']}' faction.")

Output:

Node 0 data: {'club': 'Mr. Hi'}
Node 0 joined the 'Mr. Hi' faction.
Node 33 joined the 'Officer' faction.

Visualizing the Graph Structure

A great way to build intuition for a graph is to visualize it. We can use matplotlib along with NetworkX's drawing capabilities to plot the graph.

A small portion of the graph might look something like this, where nodes 0 and 33 are the central figures of the two factions.

The Karate Club graph showing connections between members of the two factions led by node 0 ("Mr. Hi") and node 33 ("Officer").

Now, let's plot the entire graph. We can make our visualization more informative by coloring the nodes according to their club affiliation. This will give us a clear visual of the two communities.

# Create a color map based on the 'club' attribute
colors = []
for node in G:
    if G.nodes[node]['club'] == 'Mr. Hi':
        colors.append('#9775fa')  # Violet
    else:
        colors.append('#ffa94d')  # Orange

# Draw the graph
plt.figure(figsize=(8, 6))
nx.draw_spring(G, with_labels=True, node_color=colors, node_size=500)
plt.title("Zachary's Karate Club Social Network")
plt.show()

From Graph Object to Matrix Representations

import numpy as np

# Get the adjacency matrix as a NumPy array
A = nx.to_numpy_array(G)

print("Adjacency Matrix Shape:", A.shape)
print("A few entries of A:")
print(A[:5, :5])

Output:

Adjacency Matrix Shape: (34, 34)
A few entries of A:
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

For the purpose of this example, let's create a simple feature matrix where each node's feature is just its degree.

# Create a feature matrix where each feature is the node's degree
degrees = [G.degree(n) for n in G.nodes()]
X = np.array(degrees).reshape(-1, 1)

print("Feature Matrix Shape:", X.shape)
print("First 5 features (node degrees):")
print(X[:5])

Output:

Feature Matrix Shape: (34, 1)
First 5 features (node degrees):
[[16]
 [ 9]
 [10]
 [ 6]
 [ 3]]

Was this section helpful?

References

NetworkX Documentation, NetworkX Developers, 2024 - Official documentation for the NetworkX Python library, detailing graph creation, manipulation, analysis, and visualization functions.
An Information Flow Model for Conflict and Fission in Small Groups, Wayne W. Zachary, 1977 Journal of Anthropological Research, Vol. 33 (University of New Mexico) DOI: 10.1086/jar.33.4.3629752 - The original research paper introducing the Zachary's Karate Club dataset, describing the social network and the observed club split.
Networks: An Introduction, Mark E. J. Newman, 2018 (Oxford University Press) DOI: 10.1093/oso/9780198805090.001.0001 - A comprehensive textbook on network science, covering graph theory, network properties, and matrix representations relevant for graph datasets.
Graph Neural Networks: Foundations, Frontiers, and Applications, Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao, 2022 (Springer Nature) DOI: 10.1007/978-981-16-6054-2 - A textbook offering a broad overview of Graph Neural Networks, explaining graph representations and data structures in the context of GNNs.