Convergence Analysis of the Federated Learning
Published:
Introduction
This is a note to record the inference steps of the FedAvg convergence. Similar to Model concergence analysis we conduct convergence anlysis in FedAVG, and it is originated from [[Wang 等。 - 2021 - A Field Guide to Federated Optimization.pdf]] .
Model convergence Assumptions and Preliminaries
Local update SGD: xi is the local model parameter for client i. t: global round index; k: local step index x(t,k+1)i:=x(t,k)i−η∂L∂x(t,k)i=x(t,k)i−η∇x(t,k)i
Overall optimization objective: We have client population as C=1,2,3,⋯,M the objective is min
Properties of local objectives
- Each local objective F_i(x) is convex abnd L-smooth. Each client can query an unbiased stochastic gradient with σ2-uniformly bounded variance in L2 norm.
- Poperty1: Unbiased stocahstic gradient: expectation of the stochastic gradient for a given x_i^{(t,k)} is equal to the averaged local gradient for a given model \phi(\cdot) This is to say, the gradient expectation of the SGD equals to the gradient of the GD (Unbiased stochastic gradient)[[Stochastic Gradient Descent vs Gradient Descent]] \mathbb{E}[\nabla x_i^{(t,k)}|x_i^{(t,k)}]=\nabla F_i(x_i^{(t,k)})Given a dataset of $D_i = {(\mathcal{x}i,\mathcal{y}_i)}{i=0}^N$ , where the N denotes the length of the whole dataset. The objective of the GD and its gradients are calculated as:
In this case, the expectation is the weighted average of a single batch with batchsize
as bn, i.e.,
\begin{align} \mathbb{E}\left[\nabla x_i^{(t,k)}|x_i^{(t,k)}\right] &= \sum^{N-bn+1}_{j=1} \left(\sum_{i=1}^{bn}{\frac{\partial \mathcal{L}}{\partial x_{i,s_j}^{(t,k)}}}\cdot P(I=i|S=s_j)\right)\cdot P(S=s_j) \\ &= \sum^{N-bn+1}_{j=1} \left( P(I=i|S=s_j)P(S=s_j)\sum_{i=1}^{bn}{\frac{\partial \mathcal{L}}{\partial x_{i,s_j}^{(t,k)}}} \right)\\ & = \frac{1}{N}\sum_{i=1}^N{\frac{\partial \mathcal{L}}{\partial x_{i}^{(t,k)}}}\\ &= \frac{1}{N}\sum_{i=1}^{N}\nabla\mathcal{L}(\phi(\mathcal{x}_i),\mathcal{y}_i) \\ & = \nabla F_i(x_i^{(t,k)}) \end{align} \tag{1} where batch set is S = {s_1,\cdots,s_{bn}}.Note: SGD or Adam is a stochastic optimzation algorithm that select the samples from the batch randomly for gradient calculation.
- Property 2: Bounded variance: \mathbb{E}\left[\left\Vert\nabla x_i^{(t,k)}-\nabla F_i(x_i^{(t,k)})\right\Vert^2|x_i^{(t,k)}\right] \leq \sigma^2 This is to say, the gradient of the SGD is close to the gradient of the GD
Data heterogeneity/similarity across clients
Local gradient \nabla F_i(x) and global gradient \nabla F(x) is \zeta - uniformly bounded.
\max_l{\sup_x{\left\Vert \nabla F_i(x_i^{(t,k)})-\nabla F(x_i^{(t,k)})\right\Vert}} \leq \zeta
How do we measure the convergence of the FedAVG?
Generally, we want to prove that \|F(x^{(t,k+1)})-F(x^*)\|\leq\|F(x^{(t,k)})-F(x^*)\| , \forall t,k \in [1,2,3,\cdots] where F(x^*) is the optimal. Or, we give a weaker claim: \mathbb{E}\left[\frac{1}{\tau T} \sum_{t=0}^{T-1} \sum_{k=1}^\tau F\left(\overline{\boldsymbol{x}}^{(t, k)}\right)-F\left(\boldsymbol{x}^{\star}\right)\right] \leq \text{ an upper bound decreasing with } T \text {. }
Note that \mathbb{E} in this paper denotes \mathbb{E}_{i\sim \mathcal{C}} , where \mathcal{C} denotes the client set. Therefore, we can say that the \mathbb{E} is generally calculating the expectation over all the clients.
Decentralized optimization: Originated from the decentralized optimization, we derive the shadow sequence to indicate the uodate process.
\overline{x}^{(t,k)}:=\frac{1}{M}\sum^{M}_{i=1}{x_i^{(t,k)}} Then, at round t local epoch k+1, \overline{x}^{(t,k+1)}= \overline{x}^{(t,k)}-\eta \frac{1}{M}\sum^{M}_{i=1}{x_i^{(t,k)}}
Two lemmas to facilitate the convergence proof
This paper want to prove the Theorem as follows. \mathbb{E}\left[\frac{1}{\tau T} \sum_{t=0}^{T-1} \sum_{k=1}^\tau F\left(\overline{\boldsymbol{x}}^{(t, k)}\right)-F\left(\boldsymbol{x}^{\star}\right)\right] \leq \text { an upper bound decreasing with } T \text {. }
Lemma1 the paper proves that the expectation for each round is bounded. \mathbb{E}\left[\frac{1}{\tau} \sum_{k=1}^\tau F\left(\overline{\boldsymbol{x}}^{(t, k)}\right)-F\left(\boldsymbol{x}^{\star}\right)\bigg|\mathcal{F}^{(t,0)}\right] \leq Bound
Lemma2 the paper proves that the Bound in the lemma 1 is decreasing with T. Specific prove is available in Paper
Why this paper evaluate the expection instead of loss decrease in each step?
Our guess is that each-step loss may not decrease with T, but its expectation is bounded. Lets prove it in the following paragraphs.
Prove of the step-by-step decrease in loss function
We want to prove the Theorem that \|F(x^{(t,k+1)})-F(x^*)\|\leq\|F(x^{(t,k)})-F(x^*)\| , \forall t,k \in [1,2,3,\cdots] we spilt into two therom