Simulating federated learning scenarios, especially those involving Non-IID data, is an essential step in algorithm development and evaluation. It allows us to understand how different strategies perform under controlled, yet realistic, conditions before considering complex deployments.This hands-on exercise will guide you through simulating a Non-IID data distribution and comparing the performance of standard Federated Averaging (FedAvg) against a mitigation technique, specifically FedProx, which we covered in Chapter 2.Setting Up the SimulationMost federated learning simulations involve three core components: a central server, a set of clients, and a dataset partitioned across these clients. You can use frameworks like TensorFlow Federated (TFF), PySyft, or Flower (as detailed in Chapter 6) to facilitate this setup. For this exercise, we'll focus on the steps, which are applicable across different frameworks.We typically start with a standard dataset suitable for classification, such as MNIST or CIFAR-10. The critical step is how we distribute this data among the simulated clients to create statistical heterogeneity.Simulating Non-IID Data DistributionsCreating realistic Non-IID data distributions is critical for meaningful simulations. A common and effective method is to use a Dirichlet distribution to allocate class labels among clients.Imagine we have $N$ clients and $C$ data classes. We can model the distribution of classes on client $k$ using a probability vector $p_k = (p_{k,1}, ..., p_{k,C})$, where $p_{k,j}$ is the proportion of samples belonging to class $j$ on client $k$. To generate heterogeneous distributions, we can draw each $p_k$ from a Dirichlet distribution with a concentration parameter $\alpha$:$$ p_k \sim \text{Dir}(\alpha) $$A small value for $\alpha$ (e.g., $\alpha = 0.1$) results in high heterogeneity. Each client is likely to have data dominated by only one or a few classes.A large value for $\alpha$ (e.g., $\alpha = 100$) leads to distributions closer to IID, where each client has a class distribution similar to the global distribution.Implementation Steps:Choose Dataset: Select a benchmark dataset (e.g., CIFAR-10 with 10 classes).Set Parameters: Define the number of clients ($N$) and the Dirichlet parameter ($\alpha$).Generate Proportions: For each client $k=1...N$, sample a probability vector $p_k \sim \text{Dir}(\alpha)$.Assign Data: Partition the dataset samples. For each class $j$, distribute its samples among the clients according to the proportions ${p_{1,j}, p_{2,j}, ..., p_{N,j}}$. Ensure each client receives a reasonable minimum number of samples if needed.Other ways to introduce heterogeneity include varying the quantity of data per client (quantity skew) or partitioning based on underlying data features, but label distribution skew using Dirichlet is a widely adopted standard for benchmarking.Baseline: FedAvg Performance on Non-IID DataFirst, run a standard FedAvg simulation using the Non-IID data partition created above. Train a suitable model (e.g., a simple CNN for CIFAR-10) over several communication rounds.Important steps in the FedAvg round:Server selects a subset of clients.Server sends the current global model $w^t$ to selected clients.Each selected client $k$ trains the model on its local data $D_k$ for $E$ epochs, obtaining a local model $w_{k}^{t+1}$.Clients send their updated models $w_{k}^{t+1}$ back to the server.Server aggregates the models: $$ w^{t+1} = \sum_{k \in S_t} \frac{n_k}{n} w_{k}^{t+1} $$ where $S_t$ is the set of selected clients, $n_k = |D_k|$, and $n = \sum_{k \in S_t} n_k$.Monitor the global model's accuracy on a held-out, representative test set over the communication rounds. You will likely observe slower convergence and potentially lower final accuracy compared to simulations run on IID data, especially with small $\alpha$ values. This performance degradation highlights the challenge posed by Non-IID data.Mitigation: Implementing FedProxNow, let's implement FedProx to address the client drift caused by heterogeneity. Recall from Chapter 2 that FedProx modifies the local client objective function by adding a proximal term. Instead of just minimizing the local empirical loss $F_k(w) = \frac{1}{n_k} \sum_{i \in D_k} \ell(w; x_i, y_i)$, the client optimizes:$$ \min_w H_k(w) = F_k(w) + \frac{\mu}{2} |w - w^t|^2 $$Here, $w^t$ is the global model received from the server at the start of the round $t$, and $\mu \ge 0$ is a hyperparameter controlling the strength of the proximal term. This term limits how far the local model $w$ can stray from the initial global model $w^t$ during local training.Implementation Changes:Modify Client Update: When performing local training (e.g., using SGD), the gradient update needs to incorporate the gradient of the proximal term. If the local optimizer update for minimizing $F_k(w)$ is $w \leftarrow w - \eta \nabla F_k(w)$, the update for $H_k(w)$ becomes: $$ w \leftarrow w - \eta (\nabla F_k(w) + \mu (w - w^t)) $$ This can often be implemented by modifying the loss function or directly adjusting the gradients during the local training loop within your chosen framework.Tune Hyperparameter $\mu$: The value of $\mu$ is important. $\mu=0$ recovers FedAvg. A larger $\mu$ enforces more similarity to the global model, potentially counteracting drift but possibly slowing down personalization if local data truly requires a different model. Typical values might range from 0.01 to 1.0, often requiring some tuning.Run the simulation again using the same Non-IID data partition but with the FedProx client update rule. Use a non-zero value for $\mu$ (e.g., $\mu=0.1$ or $\mu=1.0$ as starting points).Comparing FedAvg and FedProx on Non-IID DataAfter running both simulations (FedAvg and FedProx) on the same Non-IID data setup, compare their performance. Plot the global model accuracy on the test set against the communication round number for both algorithms.{"data": [{"y": [15, 28, 38, 45, 51, 55, 58, 60, 62, 63, 64, 65, 66, 67, 68], "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "type": "scatter", "name": "FedAvg (Non-IID)", "line": {"color": "#228be6"}}, {"y": [16, 30, 42, 50, 57, 62, 66, 69, 71, 73, 74, 75, 76, 77, 78], "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "type": "scatter", "name": "FedProx (Non-IID, \u03bc=0.1)", "line": {"color": "#40c057"}}], "layout": {"title": {"text": "FedAvg vs. FedProx Performance on Simulated Non-IID Data"}, "xaxis": {"title": {"text": "Communication Round"}}, "yaxis": {"title": {"text": "Global Model Test Accuracy (%)"}}, "legend": {"title": {"text": "Algorithm"}}}}Comparison of test accuracy over communication rounds for FedAvg and FedProx on a synthetically generated Non-IID dataset (e.g., CIFAR-10 with Dirichlet partitioning, $\alpha=0.3$). FedProx often shows improved stability and convergence compared to FedAvg under heterogeneity.Discussion and Further StepsThe simulation results, visualized in the chart above, typically demonstrate that FedProx can lead to more stable convergence and potentially higher final accuracy compared to FedAvg when dealing with significant statistical heterogeneity. The proximal term effectively restricts clients from diverging too much during local training, mitigating the negative impact of Non-IID data on the global model aggregation.This exercise provides a foundation for experimenting with heterogeneity:Vary Heterogeneity: Change the Dirichlet parameter $\alpha$ to observe its impact on the performance gap between FedAvg and FedProx.Tune FedProx: Experiment with different values of the $\mu$ hyperparameter in FedProx. Observe the trade-off between convergence speed, final accuracy, and stability.Implement Other Methods: Try implementing and comparing other algorithms discussed in this chapter or Chapter 2, such as SCAFFOLD or clustered federated learning approaches, on the same Non-IID setup.Combine Heterogeneity: Simulate combinations of statistical and systems heterogeneity (e.g., varying local epochs $E$ per client along with Non-IID data)."By systematically simulating these scenarios, you gain valuable insights into the strengths and weaknesses of different federated learning strategies, creating a path for designing more effective and reliable systems for applications."