Building upon the algorithmic foundations laid in previous chapters, we now shift focus to the structural design of federated learning systems. Understanding the architecture is fundamental to implementing, deploying, and managing FL effectively. While variations exist, most FL systems adhere to a common architectural pattern, typically involving a central coordinating server and multiple distributed clients.
Core Components
A standard federated learning system comprises several distinct components, each with specific roles and responsibilities:
-
Clients (or Workers): These are the entities holding the local, private data used for training. Clients can range from individual mobile devices or sensors (cross-device FL) to entire organizations or data silos (cross-silo FL). Their primary responsibilities include:
- Storing and managing their local dataset securely.
- Receiving the current global model parameters and training instructions from the server.
- Performing local model training on their data for one or more epochs.
- Calculating model updates (e.g., gradients or parameter differences).
- Potentially applying privacy-preserving techniques (like differential privacy noise addition or clipping) or communication optimization methods (like compression) to their updates.
- Sending the processed updates back to the server.
-
Server (or Coordinator/Aggregator): This central entity orchestrates the entire federated learning process. It does not have direct access to the raw client data. Its main functions are:
- Initializing the global model parameters.
- Selecting a subset of available clients for participation in each training round (client selection).
- Broadcasting the current global model and training configuration (e.g., learning rate, number of local epochs) to the selected clients.
- Receiving model updates from participating clients.
- Potentially implementing secure aggregation protocols (using SMC or HE, as discussed in Chapter 3) to combine updates without viewing individual contributions.
- Aggregating the received updates (using algorithms like FedAvg, FedProx, SCAFFOLD, etc., from Chapter 2) to produce an improved global model.
- Updating the global model parameters based on the aggregated results.
- Evaluating the global model's performance (often using a held-out test set or by coordinating distributed evaluation).
- Repeating the process for a predefined number of communication rounds or until convergence criteria are met.
-
Model: This is the machine learning model being trained collaboratively. It could be any type of model suitable for the task, such as a linear model, a support vector machine, or, commonly, a deep neural network. The model architecture is typically defined centrally and shared across all clients. Personalization techniques (Chapter 4) might involve adapting parts of this model locally.
-
Communication Protocol: This defines how clients and the server interact and exchange information. It encompasses:
- Network protocols (e.g., gRPC, REST APIs over HTTPS) for transmitting data.
- Serialization formats for models and updates (e.g., Protocol Buffers, NumPy arrays).
- Mechanisms for handling client availability, dropouts, and potential network failures.
- Security measures for authentication and encrypted communication channels (discussed further in
fl-system-security-considerations
).
Typical Interaction Flow (Synchronous FL)
The most common interaction pattern, particularly in cross-silo settings or simulations, follows a synchronous, round-based approach:
- Initialization: The server defines the initial global model w0.
- Client Selection: In round t, the server selects a subset of clients St.
- Broadcast: The server sends the current global model wt to all clients in St.
- Local Training: Each selected client k∈St trains the model wt on its local data Dk for E epochs, resulting in a local model update Δwkt+1 (or the full local model wkt+1). This step often involves minimizing a local loss function Lk(w).
- Update Transmission: Each client k sends its computed update Δwkt+1 (potentially after applying privacy or compression techniques) back to the server.
- Aggregation: The server collects updates from a sufficient number of clients. It then aggregates these updates using a chosen algorithm (e.g., weighted averaging for FedAvg: Δwt+1=∑k∈St′NtnkΔwkt+1, where St′ is the set of clients that successfully returned updates, nk=∣Dk∣, and Nt=∑k∈St′nk).
- Global Model Update: The server updates the global model: wt+1=wt+ηΔwt+1 (where η is the server learning rate, often set to 1).
- Iteration: The process repeats from Step 2 for the next round (t+1) until termination.
The following diagram illustrates this typical client-server architecture and interaction flow:
A typical client-server architecture for federated learning. The server coordinates rounds of training, broadcasting the model, receiving updates from clients, and aggregating them to refine the global model. Clients perform local training on their private data.
This client-server model provides a clear separation of concerns and simplifies orchestration. However, the central server can become a bottleneck or a single point of failure. Asynchronous protocols (discussed in Chapter 5) and alternative architectures like peer-to-peer FL exist but introduce different complexities. For most advanced implementations leveraging sophisticated aggregation or privacy techniques, this server-mediated architecture remains the most common foundation. Understanding its components and flow is essential before diving into specific framework implementations and deployment considerations later in this chapter.