While Federated Averaging (FedAvg) provides a simple and intuitive baseline for federated learning, its effectiveness is often challenged in practical scenarios. Understanding its limitations is essential for appreciating why more sophisticated aggregation methods are needed. FedAvg's core idea involves clients performing local stochastic gradient descent (SGD) updates and a central server averaging the resulting model parameters. This simplicity, however, hides several assumptions that frequently break down in real-world deployments.
Let's examine the primary shortcomings:
Perhaps the most significant limitation of FedAvg stems from statistical heterogeneity, meaning the data distributions across clients are not independent and identically distributed (Non-IID). In many practical applications (like mobile keyboards predicting the next word for different users, or hospitals training models on their distinct patient populations), client data is inherently personalized and diverse.
Consider the global objective function that federated learning aims to minimize:
F(w)=k=1∑NpkFk(w)where N is the number of clients, pk is the weight for client k (often proportional to its dataset size), and Fk(w) is the local objective function for client k based on its local data Dk.
When data is Non-IID, the local optima wk∗=argminwFk(w) can be significantly different for each client k, and also different from the global optimum w∗=argminwF(w).
During local training in FedAvg, each client's model wk moves towards its local optimum wk∗. If clients perform multiple local gradient descent steps (as is common in FedAvg to reduce communication), their models can diverge substantially from each other and from the current global model wt. This phenomenon is often called client drift.
When the server averages these drifted local models (wt+1=∑k∈Stpkwkt+1, where St is the set of clients selected in round t), the resulting global model wt+1 might not be a good compromise. It could perform poorly across all clients or even diverge. Simple averaging implicitly assumes that the average of local optima is close to the global optimum, which fails under significant Non-IID conditions.
Client models train towards local optima (colored circles) potentially far from the global optimum (star) due to Non-IID data. Averaging these drifted models may result in a suboptimal global model (orange circle).
This client drift due to Non-IID data leads to:
Federated networks often consist of clients with vastly different capabilities:
Standard synchronous FedAvg, where the server waits for updates from a selected cohort of clients before performing aggregation, is particularly vulnerable to systems heterogeneity. The overall progress is dictated by the slowest client in each round (the "straggler" problem). This can severely slow down training, especially in large-scale deployments with diverse devices.
While asynchronous FL variants exist, simple averaging in these settings also faces challenges. Averaging models or updates computed based on different versions of the global model (staleness) without proper handling can degrade convergence and stability. Furthermore, if clients perform different amounts of local work (e.g., variable number of local epochs due to computational constraints) before sending updates, naive averaging might improperly weight their contributions.
FedAvg treats all client updates equally during the averaging step. It assumes clients are honest and follow the protocol correctly. However, a malicious client (often called a Byzantine client) could intentionally send corrupted or poisoned updates designed to degrade the global model's performance or introduce backdoors.
Since FedAvg simply averages the parameters:
wt+1=∣St∣1k∈St∑wkt+1(unweighted version)a single client sending parameter values with extremely large magnitudes could completely dominate the average, effectively hijacking the global model. FedAvg lacks inherent robustness against such attacks. Even a small number of malicious participants can severely compromise the entire training process.
While often cited as a benefit, communication efficiency in FedAvg still presents challenges. Transmitting full model updates (potentially millions of parameters for deep learning models) repeatedly can be costly, especially over constrained networks (like mobile connections). FedAvg aims to reduce the number of communication rounds compared to sending every gradient update, but the size of each communication payload remains large.
Furthermore, the convergence rate of FedAvg can be substantially slower than centralized training, particularly when dealing with high variance in client updates (due to sampling or Non-IID data) or when the number of local steps (E) is not carefully tuned. The averaging process itself is an approximation, and the variance introduced by client sampling and local updates can impede smooth convergence towards the global minimum.
These limitations highlight that while FedAvg laid the groundwork, it often falls short in complex, real-world federated settings. The subsequent sections in this chapter introduce advanced aggregation algorithms specifically designed to overcome these challenges, leading to more robust, efficient, and accurate federated learning systems. FedProx tackles client drift from Non-IID data, SCAFFOLD addresses variance and drift, FedNova handles systems heterogeneity, and Byzantine-robust methods provide resilience against malicious actors.
© 2025 ApX Machine Learning