Your compute cluster is only as fast as its slowest component. While GPUs provide immense computational power, they are completely dependent on the network to receive data and to coordinate with other nodes during distributed tasks. Minimizing the communication component of the total training time is a significant part of infrastructure design.
Ttotal=Tcompute+TcommunicationAn under-provisioned network effectively throttles your expensive GPUs, leaving them idle while they wait for data. In this section, we'll examine the networking components and architectures required to feed the beast, ensuring your Tcommunication term is as small as possible.
When planning your on-premise network, two metrics are of primary importance: bandwidth and latency.
Bandwidth is the data throughput capacity of the network, typically measured in gigabits per second (Gbps). High bandwidth is essential for operations that move large volumes of data, such as loading massive datasets from a storage server, checkpointing a large model, or transferring the model weights between nodes in certain parallelism strategies. A standard 1 Gbps office network is wholly insufficient; modern AI clusters typically start at 25 Gbps and frequently use 100 Gbps or faster connections.
Latency is the time delay for a packet of data to travel from its source to its destination, measured in milliseconds (ms) or microseconds (µs). Low latency is extremely important for synchronous distributed training, where multiple workers must frequently exchange small packets of information (like gradients) and wait for each other to complete before proceeding to the next step. High latency in this scenario creates a significant bottleneck, as all nodes are forced to wait for the slowest communication link to complete.
While Ethernet is the most common networking technology, high-performance computing (HPC) and AI clusters often use specialized technologies like InfiniBand, which is designed from the ground up for the highest bandwidth and lowest possible latency.
In a standard network stack using TCP/IP, sending data from an application on Server A to an application on Server B involves multiple steps. The data is copied from the application's memory space to the operating system's kernel space, processed by the TCP/IP stack, and then sent to the network card. The process is reversed on the receiving end. These copies and kernel-level interventions add significant latency and consume valuable CPU cycles.
Remote Direct Memory Access (RDMA) is a technology that changes this process entirely. It allows the network interface card (NIC) of one server to access the main memory of another server directly, without involving either server's operating system or CPU. This bypasses the TCP/IP stack and eliminates memory copies, drastically reducing latency and freeing up the CPU to focus on computation.
For distributed training workloads that require frequent, rapid communication, RDMA is not just a "nice-to-have" feature; it is a fundamental requirement for achieving high performance. RDMA is a native feature of InfiniBand and is also available over Ethernet through a protocol called RoCE (RDMA over Converged Ethernet).
The RDMA path bypasses kernel-level data copies and context switches, resulting in lower latency and reduced CPU overhead compared to the standard TCP/IP communication path.
The way you physically connect your servers and switches, known as the network topology, has a direct impact on performance and scalability.
For a small setup with just two to four servers, a simple star topology is often sufficient. In this design, all servers connect directly to a single, high-performance switch. This is straightforward to implement and manage, but the central switch can become a performance bottleneck and represents a single point of failure as the cluster grows.
For larger, multi-rack clusters, a leaf-spine topology is the industry standard. This design consists of two layers of switches:
This architecture provides multiple communication paths between any two servers in the cluster. It ensures that traffic between any two nodes only has to traverse a leaf switch and a spine switch, leading to predictable, low latency. The aggregate bandwidth of a leaf-spine network scales linearly as you add more spine switches, making it an excellent choice for building large, high-performance AI factories.
A simple star topology connects all nodes to one switch, while a scalable leaf-spine topology uses two layers of switches to provide high, predictable bandwidth between all nodes in a larger cluster.
When creating the specification for your on-premise AI server, the network is a first-class citizen alongside the CPU and GPU. Your planning should include:
By carefully planning your network, you ensure that communication overhead does not become the limiting factor in your system's performance, allowing your computational hardware to operate at its full potential.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with