GPU Interconnect Technologies

When you connect multiple GPUs in a server, their ability to communicate efficiently becomes as important as their individual processing power. This communication channel, or interconnect, is the physical pathway data travels between GPUs and between GPUs and the CPU. The total time for a training job is a sum of computation and communication times: $T_{total} = T_{compute} + T_{communication}$ . A slow interconnect can create a severe bottleneck, leaving your expensive GPUs waiting for data and dramatically increasing $T_{communication}$ . Two primary interconnect technologies are encountered when building an on-premise system: the standard PCI Express bus and NVIDIA's specialized NVLink.

The Standard: PCI Express (PCIe)

Peripheral Component Interconnect Express, or PCIe, is the standard high-speed interface used to connect processors to peripherals like GPUs, high-speed network cards, and NVMe drives on a motherboard. Every modern server and desktop computer uses it. When you install a GPU into a slot on a motherboard, you are connecting it to the system's PCIe bus.

For a single-GPU setup, PCIe is perfectly adequate. The GPU communicates with the CPU to get data and instructions, performs its parallel computations, and sends results back. The bottleneck, if any, is usually elsewhere.

The challenge arises in multi-GPU configurations. In a typical server motherboard, GPUs communicating over the PCIe bus must route their traffic through the CPU. If GPU 1 needs to send an update to GPU 2, the data path often looks like this: GPU 1 -> PCIe Bus -> CPU -> PCIe Bus -> GPU 2. This indirect path introduces significant latency and consumes the limited PCIe bandwidth connected to the CPU. While technologies like PCIe Peer-to-Peer (P2P) can enable more direct transfers, performance can be inconsistent and depends heavily on the motherboard's topology.

The bandwidth of a PCIe connection is determined by its version (e.g., 3.0, 4.0, 5.0) and the number of lanes it uses (e.g., x8, x16). For example, a PCIe 4.0 x16 slot offers a theoretical bidirectional bandwidth of 64 GB/s, while PCIe 5.0 doubles that to 128 GB/s. While impressive, this bandwidth is shared among all devices connected to the CPU and can become saturated during intense multi-GPU synchronization.

The High-Speed Alternative: NVIDIA NVLink

To address the limitations of PCIe for multi-GPU workloads, NVIDIA developed NVLink, a proprietary high-speed interconnect that provides a direct communication link between GPUs. Instead of routing traffic through the CPU, NVLink-enabled GPUs can exchange data directly, as if on a private highway. This dramatically reduces latency and provides a massive increase in bandwidth dedicated solely to inter-GPU communication.

The diagram below illustrates the difference in data paths between a standard PCIe-based system and an NVLink-bridged system.

In a PCIe-only setup (left), communication between GPUs is arbitrated by the CPU. With an NVLink Bridge (right), GPUs gain a direct, high-bandwidth connection, freeing the PCIe bus for communication with the CPU and storage.

This direct path is a game-changer for distributed training techniques like model parallelism, where different layers of a large model reside on separate GPUs and must constantly exchange intermediate activations. It is also highly effective for data parallelism, where gradients must be averaged across all GPUs after each step. By minimizing $T_{communication}$ , NVLink allows training to scale more efficiently across multiple GPUs.

Comparing Bandwidth: A Quantitative Look

The difference in performance between these two technologies is not subtle. NVLink bandwidth has increased with each generation, consistently outpacing the corresponding PCIe standards available at the time.

Comparison of theoretical bidirectional bandwidth for common PCIe and NVLink generations. Note that NVLink values (e.g., 900 GB/s for 4th Gen) are often for a fully-connected set of GPUs in a high-end server node like an HGX system, using multiple links per GPU.

This chart highlights the order-of-magnitude difference. The 900 GB/s bandwidth of 4th generation NVLink (used with H100 GPUs) is over 7 times that of PCIe 5.0. This massive bandwidth is what enables the training of today's largest models in a reasonable amount of time.

Practical Notes for System Design

Your choice of interconnect technology has direct consequences on your server build, budget, and performance.

Workload Dependence: If your primary workloads involve single-GPU jobs or distributed tasks with very infrequent communication, the additional expense of NVLink may not be justified. A system with multiple PCIe 4.0 or 5.0 slots can be a cost-effective solution for running many independent experiments in parallel.
Requirement for Scale: If your goal is to accelerate the training of a single, large model across multiple GPUs, NVLink is practically a necessity. The performance gains in reducing $T_{communication}$ will far outweigh the initial hardware cost by shortening training cycles and improving GPU utilization.
Hardware Compatibility: NVLink is a feature of high-end, data-center-class NVIDIA GPUs (like the A100 and H100 series). It is not available on most consumer-grade GeForce cards. Furthermore, to connect GPUs with an NVLink bridge, you need a motherboard that physically spaces its PCIe slots correctly to accommodate the rigid bridge. For systems with more than two GPUs, you will typically need a specialized server platform, like an NVIDIA-Certified System, designed specifically for high-density, NVLink-connected GPU configurations.

Ultimately, the decision between a PCIe-based design and an NVLink-based one is a trade-off. You must balance the upfront capital expenditure against the performance needs of your most demanding AI workloads. For organizations serious about large-scale model development, investing in NVLink infrastructure is an investment in faster iteration and a more capable research platform.

Was this section helpful?

References

NVIDIA H100 GPU Architecture In-Depth, NVIDIA Corporation, 2022 (NVIDIA) - Discusses the architecture of the H100 GPU, including detailed information on the 4th generation NVLink and its role in accelerating multi-GPU communication.
Computer Architecture: A Quantitative Approach, John L. Hennessy, David A. Patterson, 2017 (Elsevier) - A classic textbook providing fundamental insights into computer architecture, including explanations of I/O subsystems and interconnects like PCIe.
A Survey of Communication Bottlenecks in Distributed Deep Learning, Youngeun Kang, Jianzong Li, Kai Zeng, Dingwen Zeng, 2020 ACM Computing Surveys, Vol. 53 (Association for Computing Machinery (ACM)) DOI: 10.1145/3418544 - This survey identifies and analyzes various communication bottlenecks in distributed deep learning, including the role of GPU interconnects like PCIe and NVLink.
Deep Learning Systems: Algorithms, Architectures, and Optimizations for Big Data, K. M. Krishna, 2021 (CRC Press) - Provides context on the architectural requirements for deep learning systems, including the role of high-speed interconnects in scaling performance across multiple GPUs.