While Federated Averaging (FedAvg) provides a simple mechanism for aggregating client contributions, its reliance on a standard mean calculation makes it highly susceptible to disruptions. In environments where some participants might be faulty or actively malicious (so-called Byzantine clients), FedAvg can be easily compromised. A single client submitting a carefully crafted, arbitrarily large, or simply nonsensical update can drastically shift the global model away from the optimal solution, potentially causing divergence or even introducing hidden backdoors. This vulnerability necessitates aggregation strategies designed specifically to withstand such adversarial behavior.
The core idea behind Byzantine-robust aggregation is borrowed from robust statistics. Instead of using the arithmetic mean, which is notoriously sensitive to outliers, these methods employ estimators that are less affected by extreme values. The goal is to identify and downweight or completely discard updates that appear anomalous compared to the majority, assuming that Byzantine clients constitute a minority and their updates deviate significantly from those submitted by honest clients.
Let's examine several prominent Byzantine-robust aggregation methods:
Perhaps the simplest robust alternative to the mean is the coordinate-wise median. Instead of averaging the update vectors component by component, we compute the median for each coordinate across all received client updates.
Suppose we receive n update vectors v1,v2,...,vn, each of dimension d. The aggregated update vagg is computed such that its j-th component (vagg,j) is the median of the j-th components of all received updates:
vagg,j=median(v1,j,v2,j,...,vn,j)
for each dimension j=1,...,d.
This approach is computationally efficient, significantly faster than more complex methods like the geometric median. While its theoretical guarantees are sometimes weaker, it often performs well in practice and can tolerate a fraction of nearly 50% Byzantine clients under certain assumptions. Its primary weakness is that it treats each dimension independently, potentially missing malicious updates crafted across multiple dimensions in a correlated way.
The Trimmed Mean offers another computationally reasonable approach. It works by discarding a fraction of the updates deemed most likely to be outliers before calculating the mean of the remaining ones.
Specifically, for each coordinate j, the server sorts the values v1,j,v2,j,...,vn,j. Assuming an upper bound β on the fraction of expected Byzantine clients, the server removes the βn smallest and βn largest values for that coordinate. The j-th component of the aggregated update vagg,j is then the mean of the remaining (1−2β)n values.
The effectiveness of Trimmed Mean hinges on correctly estimating β. If β is set too low, the aggregation might still be vulnerable to Byzantine influence. If set too high, updates from honest clients might be unnecessarily discarded, potentially slowing down convergence or reducing model quality, especially in highly heterogeneous (Non-IID) settings where honest updates might naturally have large variations.
Krum takes a different approach by selecting a single update deemed most "representative" or "central". It assumes that honest clients' updates will be relatively close to each other in the parameter space, while Byzantine updates will be outliers.
For each received update vi, Krum calculates a score based on the sum of squared Euclidean distances to its k nearest neighbors among the other updates. Let Nk(i) be the set of indices of the k updates closest to vi (excluding vi itself). The score for vi is:
score(i)=∑l∈Nk(i)∣∣vi−vl∣∣22
The server then selects the single update vi∗ with the minimum score as the aggregated result for the round:
i∗=argminiscore(i) vagg=vi∗
The parameter k must be chosen carefully based on the assumed maximum number f of Byzantine clients, typically k=n−f−2, ensuring that the neighborhood calculation for the best honest client excludes all Byzantine clients.
Krum guarantees robustness if the number of Byzantine clients is less than (n−k−1)/2. However, selecting only one update can lead to slow convergence or instability. Multi-Krum addresses this by selecting the m updates with the lowest Krum scores (where m is typically small, e.g., m=5 or m=10) and averaging them. This often provides a better balance between robustness and stability.
Illustration of Krum. Honest updates (blue) cluster together, while Byzantine updates (red) are outliers. Krum identifies an update (e.g., the filled blue circle) whose nearest neighbors are close by, indicating it's likely honest.
Choosing a Byzantine-robust aggregator involves several trade-offs:
In practice, the choice depends heavily on the specific application, the anticipated threat level, the computational resources available at the server, and the degree of statistical heterogeneity across clients. If Byzantine attacks are a significant concern, the computational overhead of methods like Coordinate-wise Median, Trimmed Mean, or Multi-Krum is often a worthwhile investment compared to the potential failure of standard FedAvg. These advanced aggregation rules are essential tools for building dependable federated learning systems in potentially adversarial environments.
© 2025 ApX Machine Learning