Having established the federated optimization objective, F(w)=∑k=1NpkFk(w), a fundamental design choice is how to orchestrate the client training and server aggregation steps over time. This leads to two primary operational models: synchronous and asynchronous federated learning. The choice between them significantly impacts system performance, convergence behavior, and tolerance to real-world constraints like varying client speeds and availability.
Synchronous Federated Learning
Synchronous FL operates in distinct, coordinated rounds. It's the model most commonly associated with the canonical Federated Averaging (FedAvg) algorithm. The process generally follows these steps:
- Selection: The central server selects a subset of available clients (Ct) to participate in the current training round t.
- Distribution: The server sends the current global model state (wt) to the selected clients.
- Local Training: Each selected client k∈Ct performs local computations, typically multiple steps of gradient descent on its local data Dk, starting from wt to produce a local model update Δwkt+1 or a new local model wkt+1.
- Upload & Wait: Clients upload their computed updates to the server. The server waits until it receives updates from all selected clients (or a predefined minimum number/fraction).
- Aggregation: Once the required updates are received, the server aggregates them (e.g., using weighted averaging: wt+1=wt+∑k∈CtpkΔwkt+1 or wt+1=∑k∈Ctpkwkt+1) to produce the new global model wt+1.
- Repeat: The process repeats starting from step 1 for the next round (t+1).
Synchronous FL operation. The server waits for all selected clients, including slower ones (stragglers), before performing aggregation and proceeding to the next round.
Advantages:
- Simplicity: The lock-step nature simplifies implementation and theoretical analysis. Aggregation uses updates derived from the same global model version (wt).
- Convergence Guarantees: Easier to analyze convergence properties, building upon standard distributed optimization theory. Many established FL algorithms (FedAvg, FedProx, SCAFFOLD) were initially proposed in a synchronous setting.
Disadvantages:
- Straggler Problem: The overall round time is dictated by the slowest client in the selected cohort. This leads to inefficient use of faster clients and can significantly slow down training, especially in environments with high systems heterogeneity (varying compute power, network bandwidth).
- Underutilization: Faster clients sit idle waiting for stragglers.
- Sensitivity to Dropouts: If a client drops out mid-round, the server might wait indefinitely unless mechanisms like timeouts or minimum participation thresholds are implemented, which can introduce bias.
Asynchronous Federated Learning
Asynchronous FL aims to mitigate the straggler problem by decoupling client computation and server aggregation. There isn't a single definition, but the core idea is that the server does not wait for a fixed cohort of clients in lock-step rounds.
A common approach works as follows:
- Client Ready: A client indicates its availability to the server.
- Distribution: The server sends the current global model state (wserver) to the client. Note that wserver might have been updated since other clients started their computations.
- Local Training: The client performs its local computations based on the received model and its local data, producing an update Δwk.
- Upload: The client uploads its update Δwk to the server.
- Immediate Aggregation (or Buffering): Upon receiving an update Δwk (which was computed based on an potentially older model version wclient_received), the server immediately updates the global model. This update might use staleness-aware adjustments. For example, a simple update could be wserver←wserver+αΔwk, where α is a learning rate or weight. Alternatively, the server might buffer updates and apply them periodically.
- Continuous Operation: The process operates continuously; clients request models and submit updates independently without synchronized rounds.
Asynchronous FL operation. Clients fetch the latest model, compute, and upload updates independently. The server aggregates updates as they arrive, potentially using stale information.
Advantages:
- Improved Efficiency: Overcomes the straggler bottleneck. Faster clients contribute more frequently, potentially leading to faster wall-clock time convergence.
- Higher Throughput: The system processes updates continuously, leading to better resource utilization, especially in heterogeneous environments.
- Robustness to Availability: Naturally handles clients joining and leaving or having variable response times.
Disadvantages:
- Staleness: Client updates are computed based on older versions of the global model. Aggregating these "stale" updates can hinder convergence or cause oscillations, as the updates might not point in the correct direction for the current global model state. The degree of staleness (tcurrent−tclient_received) is a significant factor.
- Convergence Challenges: Theoretical analysis is more complex due to staleness. Convergence might be slower in terms of the number of updates required (though faster in wall-clock time) compared to synchronous methods if staleness is not managed properly.
- Implementation Complexity: Requires careful management of model versions and potentially sophisticated aggregation functions that account for staleness (e.g., weighting updates based on their age).
Comparison and Trade-offs
Feature |
Synchronous FL |
Asynchronous FL |
Round Structure |
Discrete, synchronized rounds |
Continuous operation, no fixed rounds |
Pace |
Limited by the slowest participant (straggler) |
Determined by average client speed; faster progress |
Update Staleness |
Low (all updates based on same wt) |
High (updates based on different, older wserver) |
System Util. |
Can be low due to waiting |
Generally higher, less idle time |
Convergence Theory |
More established, simpler analysis |
More complex due to staleness |
Implementation |
Simpler server logic |
More complex server logic (versioning, aggregation) |
Heterogeneity |
Sensitive to system heterogeneity (stragglers) |
More robust to system heterogeneity |
Fault Tolerance |
Sensitive to dropouts within a round |
Naturally handles client availability fluctuations |
The choice between synchronous and asynchronous FL depends heavily on the specific application context:
- For environments with relatively homogeneous clients and reliable networks, synchronous FL offers simplicity and potentially more stable convergence per communication round.
- For highly heterogeneous environments (common in cross-device FL) with clients of varying speeds and unpredictable availability, asynchronous FL can offer significant speedups in terms of wall-clock training time, provided the negative effects of staleness are adequately managed through techniques discussed later in this course (e.g., staleness-aware aggregation, adaptive learning rates).
Understanding these fundamental operational models is essential before examining more advanced algorithms designed to improve aggregation, privacy, and efficiency, as many techniques can be adapted for either synchronous or asynchronous settings, albeit with different implications.