As outlined in the chapter introduction, the sheer volume of data transferred between clients and the server, particularly gradient information sent from clients, often limits the scalability and efficiency of federated learning systems. Gradient compression techniques directly attack this problem by reducing the number of bits required to represent each client's gradient update before transmission.
The core idea is simple: instead of sending the full-precision, often dense, gradient vector computed locally (let's denote it as g), the client computes a compressed version gcompressed=C(g), sends gcompressed, and the server then uses an aggregation strategy suitable for these compressed updates, perhaps after decompressing them using D(gcompressed). The compression function C and decompression D are designed such that the size of gcompressed is significantly smaller than the size of g.
Two primary strategies dominate this space: quantization and sparsification.
Quantization reduces the number of bits used to represent each numerical value within the gradient vector. Instead of using standard 32-bit or 64-bit floating-point numbers for every component of the gradient, quantization maps these values to a smaller set of discrete levels, representable with fewer bits.
Types of Quantization:
Mechanisms:
A common approach is stochastic quantization. For a gradient component gi, instead of transmitting its precise value, we might transmit a quantized value qi chosen randomly such that its expected value is the original value, E[qi]=gi. This introduces noise but keeps the process unbiased on average.
For example, consider quantizing a value x∈[0,1] to either 0 or 1. Stochastic quantization would output 1 with probability x and 0 with probability 1−x.
A well-known algorithm employing quantization is Quantized SGD (QSGD). QSGD typically involves:
The server receives these quantized gradients, decodes them, and aggregates them.
Trade-offs: Quantization directly reduces the payload size. For instance, moving from 32-bit floats to 8-bit integers achieves a 4x reduction. However, this introduces quantization error (noise). While stochastic methods aim to be unbiased, the added variance can slow down convergence or slightly reduce the final model accuracy. There's also a computational cost for performing the quantization on the client and dequantization/aggregation on the server.
Communication cost decreases significantly with fewer bits, but the potential quantization error increases, potentially affecting model convergence and accuracy.
Sparsification takes a different approach. Instead of sending lower-precision versions of all gradient components, it sends only a small subset of the components, typically those deemed most significant, at full precision. The remaining components are effectively treated as zero for that communication round.
Mechanisms:
The most common technique is Top-k sparsification:
The value of k is a hyperparameter, often expressed as a percentage of the total number of parameters (e.g., send the top 1% or top 10% of gradients).
Example:
If g=[0.1,−2.5,0.8,−0.3,1.9] and we use Top-2 sparsification:
The largest magnitudes are ∣−2.5∣=2.5 and ∣1.9∣=1.9.
The sparse gradient sent would represent [0,−2.5,0,0,1.9]. This could be encoded as [(1, -2.5), (4, 1.9)]
.
Trade-offs: Sparsification can dramatically reduce communication size, especially if k is much smaller than the total dimension of the gradient. However, discarding the information from the smaller gradient components can significantly impact convergence. If different clients consistently have large gradients for different parameters, the aggregation might be less effective. The effectiveness also depends on the nature of the gradients; models with naturally sparse gradients benefit more. Sending indices along with values adds overhead, so the actual compression ratio depends on k and the encoding format. Like quantization, sparsification adds computational overhead for selecting the top k values.
Process flow for Top-k gradient sparsification.
It's also possible to combine quantization and sparsification. For instance, one could first select the Top-k gradients and then quantize only those values before transmission. This offers a potential path to even greater compression but requires careful tuning, as the errors and information loss from both techniques can compound.
Both quantization and sparsification are active areas of research. Their effectiveness is highly dependent on the specific FL setup, including the machine learning model, the data distribution across clients (heterogeneity), the optimization algorithm used, and the chosen compression parameters (number of bits for quantization, value of k for sparsification). The next section discusses error compensation techniques, which are often used alongside compression to mitigate the negative impacts on model training.
© 2025 ApX Machine Learning