As highlighted in the introduction to this chapter, communication costs, particularly the uplink from numerous clients to a central server, represent a significant performance bottleneck in many federated learning scenarios. While techniques like gradient compression directly tackle the size of the messages, another complementary approach focuses on reducing the frequency of communication rounds. This is achieved by optimizing the amount of computation performed locally on each client device before synchronization.
In the standard Federated Averaging (FedAvg) algorithm, two primary parameters govern this balance: the number of clients participating in each round (K), and the number of local training epochs (E) each selected client performs on its data before sending an update. This section examines the role of E and the associated trade-offs.
Increasing the number of local epochs (E) allows each client to perform more steps of local optimization (e.g., stochastic gradient descent) using its own dataset between communication rounds. Intuitively, if clients make more progress locally, fewer communication rounds might be needed to reach a target global model accuracy. Each communication round incurs overhead (network latency, server aggregation time), so reducing the total number of rounds can lead to faster overall training in terms of wall-clock time.
Consider the basic federated optimization objective:
wminF(w)=i=1∑NpiFi(w)where w represents the global model parameters, N is the total number of clients, pi is the weight for client i (often proportional to its dataset size), and Fi(w) is the local objective function for client i.
When a client i performs E local epochs, it starts with the current global model wt and iteratively updates it using its local data, resulting in a local model wt+1i. The server then aggregates these local models (e.g., averaging) to produce the next global model wt+1.
If E=1, clients perform only one pass (or potentially just one batch update) over their local data. This closely resembles a large-batch synchronous SGD approach distributed across clients, often referred to as Federated SGD (FedSGD). Communication happens frequently.
If E>1, clients perform multiple updates locally. This reduces communication frequency but introduces a potential challenge: client drift. As clients optimize intensely on their local, potentially unique (Non-IID) data distributions, their local models (wt+1i) might diverge significantly from each other and from the optimal solution for the global objective F(w). Aggregating these diverged models can slow down convergence or even prevent the global model from reaching a good solution.
The effectiveness of increasing E is strongly tied to the degree of statistical heterogeneity (Non-IID data) across clients, a topic discussed extensively in Chapter 4.
Beyond statistical heterogeneity, system heterogeneity (variations in client hardware, network connectivity, and availability) also influences the optimal choice of E. In synchronous FL, the server waits for all K selected clients to complete their E local epochs before proceeding. If E is large, clients with slower processors or limited power will take longer, delaying the entire round. These "stragglers" can severely limit the effective training speed.
Choosing a smaller E reduces the computation time per round, mitigating the impact of stragglers in synchronous settings. Asynchronous protocols, discussed later in this chapter, offer alternative ways to handle stragglers without being strictly bottlenecked by the slowest device, potentially making larger E values more viable in heterogeneous systems.
Finding the "best" value for E is often an empirical process specific to the application, model architecture, dataset characteristics, and system constraints. There's typically a sweet spot:
A common practical approach is to start with a small E (e.g., 1 to 5) and gradually increase it while monitoring convergence. Important metrics to track include:
The goal is usually to minimize wall-clock time to reach a desired accuracy level.
This chart shows a hypothetical scenario illustrating how increasing the number of local epochs (E) can initially speed up convergence in terms of wall-clock time (e.g., E=5,E=10 reach higher accuracy faster than E=1). However, excessively large E (e.g., E=20) might lead to slower convergence or instability due to client drift, especially on heterogeneous data. The optimal E balances computation and communication.
Also, consider the interaction between E and the local learning rate. Performing more local updates (E>1) might require using a smaller local learning rate compared to the E=1 case to maintain stability.
In summary, optimizing local computation by tuning E is a significant lever for improving the efficiency of federated learning systems. It allows trading off local processing effort against communication frequency. However, this tuning must carefully account for the statistical properties of the data (Non-IID) and the systemic characteristics (hardware/network variability) of the participating clients to avoid issues like client drift and straggler bottlenecks.
© 2025 ApX Machine Learning