nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 04.11.2022 | Original Article

Semi-HFL: semi-supervised federated learning for heterogeneous devices

verfasst von: Zhengyi Zhong, Ji Wang, Weidong Bao, Jingxuan Zhou, Xiaomin Zhu, Xiongtao Zhang

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

In the vanilla federated learning (FL) framework, the central server distributes a globally unified model to each client and uses labeled samples for training. However, in most cases, clients are equipped with different devices and are exposed to a variety of situations. There are great differences between clients in storage, computing, communication, and other resources, which makes unified deep models used in traditional FL cannot fit clients’ personalized resource conditions. Furthermore, a great deal of labeled data is needed in traditional FL, whereas data labeling requires a great investment of time and resources, which is hard to do for individual clients. As a result, clients only have a vast amount of unlabeled data, which goes against the federated learning needs. To address the aforementioned two issues, we propose Semi-HFL, a semi-supervised federated learning approach for heterogeneous devices, which divides a deep model into a series of small submodels by inserting early exit branches to meet the resource requirements of different devices. Furthermore, considering the availability of labeled data, Semi-HFL introduces semi-supervised techniques for training in the above heterogeneous learning process. Specifically, two training phases are included in the semi-supervised learning process, unsupervised learning on clients and supervised learning on the server, which makes full use of clients’ unlabeled data. Through image classification, text classification, next-word prediction, and multi-task FL experiments based on five kinds of datasets, it is verified that compared with the traditional homogeneous learning method, Semi-HFL not only achieves higher accuracies but also significantly reduces the global resource overhead.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With the development of technology, the computing capabilities of end devices such as mobile phones have been greatly improved, and an increasing number of end devices are now capable of doing complicated computing tasks. In a self-driving scenario, for example, the vehicle needs to see, distinguish and plan the path of the surrounding environment in real time. A distributed computing framework is formed when massive end devices engage in the calculation. For model training, traditional distributed computing solutions require users to upload their data directly to the server. However, this process consumes a significant amount of communication resources. More critically, the majority of these data contain personal information about users, posing a serious threat to their privacy. Federated learning (FL) has become a popular distributed training framework in recent years, because it allows the model to be trained locally and effectively eliminates data leaking. Under the FL framework, a central server aggregates multiple clients’ model parameters, and finally obtains a globally unified large model, realizing cross-device and cross-region collaborative training under the premise of fully protecting users’ privacy.

In fact, under the distributed computing framework, the composition of end devices is extremely complex, not only in terms of quantity but also in terms of variety, such as mobile phones, smart wearable devices, cameras, and so on. Because different clients are exposed to different situations and tasks, their computation, communication, and storage capabilities vary greatly. Even if the same type of task is performed, there is strong heterogeneity in environments and other factors, which is called system heterogeneity [1]. As a result, in the federated learning process, the upper limit of model complexity at which each client can participate in learning is not the same. To maximize the usage of clients’ data for training, model complexity must be adapted to different clients [2, 3]. However, in a traditional federated learning framework, a global unified model is distributed to each client, and then, models with the same structures are aggregated without taking into account the problem of system heterogeneity, resulting some stragglers’ data features failing to be integrated into the global model. In addition, traditional federated learning also necessitates labeled data on the client. Unfortunately, in most cases, building a sufficiently large labeled database is extremely challenging. Data labeling not only takes a long time but also requires a large number of specialists to do low-skilled, high-repetitive work, resulting in a waste of human resources. In real-world scenarios, there is a significant amount of unlabeled data on clients. To lessen the demand for client-side labeled data, it is critical to deconstruct the premise of supervised learning in the federated learning framework by integrating semi-supervised learning approaches to FL.

Currently, as far as we know, studies about heterogeneous federated learning are mostly based on fully labeled data on clients [2, 4]. There is still a wide gap between heterogeneous federated learning and semi-supervised learning. To solve the above problems simultaneously, we propose a semi-supervised federated learning method for heterogeneous devices (Semi-HFL), which allows heterogeneous clients to learn together with only a limited amount of labeled data on the server. First, for clients with different resources, based on the idea of multi-branch fast inference [5], a model can be divided into several small models that can independently complete training and inference tasks by inserting an early exit branch in the middle of the model (shown in Fig. 1), forming a series of submodels of various complexities that satisfy various client resource requirements. In the process of federated learning, models of suitable size are distributed to matching clients. After multiple iterations of local updates, they are sent to the server for aggregation; thus, a global model with multiple branches is formed. In the inference stage, the global model can achieve fast inference through middle branches. Meanwhile, to make full use of the massive unlabeled data from clients, we use a tiny amount of labeled data on the server to pretrain the multi-branch model. After all submodels are distributed to the clients, pseudo-labels are generated for local training; thus, semi-supervised learning is realized. In this paper, we will verify the effectiveness of Semi-HFL in different data distributions. To reduce the impact of data skew on the accuracy of models, we introduce a regularization term [6] in the loss function to balance the parameters of local and global models.

The main contributions are as follows:

A novel heterogeneous federated learning method called Semi-HFL is proposed, which introduces a multi-branch model to solve the system heterogeneity problem in FL. There are two main innovations. First, a splitting method of multi-branching models is designed to split the global model into submodels of different complexities. Second, we give a novel aggregation method for aggregating the split submodels into a global one in the aggregation step of FL.
The semi-supervised learning method under the above FL framework is innovated. For heterogeneous federated settings, a “multi-teacher to multi-student” semi-supervised learning mode is formed using a modest quantity of labeled data on the server. It breaks the limitation that the traditional semi-supervised FL method applies only to the single-exit model.
The convergence of Semi-HFL is analyzed, and its feasibility is verified through image and text classification experiments on different data distributions. The effectiveness of Semi-HFL is proved theoretically and practically.

In the following content, related work is introduced in Sect. “Related works”. Section. “The proposed method: Semi-HFL” and Sect. “Algorithm” illustrate Semi-HFL from mathematical analysis and algorithm perspective, respectively. The convergence analysis will be shown in Sect.“Convergence analysis” and experiments are conducted in Sect. “Experimental verification”. Finally, Sect. “Conclusions” concludes the paper.

Heterogeneous federated learning

Federated learning can connect a great number of clients to realize collaborative training, which has been applied to many fields like neural architecture search (NAS) [7], industrial cyber-physical system [8], recommender system [9], etc. However, due to the different environments and equipments each client faces, there is significant heterogeneity among clients. This heterogeneity can be roughly divided into data heterogeneity caused by unbalanced data distribution, system heterogeneity caused by different client resource status, and model heterogeneity caused by various tasks [1]. Aiming at data heterogeneity, Wang et al. [10] proposed a monitoring scheme, which can infer the composition of training data, and designed a new loss function called Ratio Loss to reduce the impact of imbalance. Based on the homomorphic encryption technique, [11] chose to improve data heterogeneity through user selection. Besides, some researchers [12] optimized models by learning the global feature representation shared between non-IID data. As for system heterogeneity, [2] constructed a series of models of different complexities to adapt to various devices by reducing the width of the hidden layer. [4] proposed a federated learning protocol that can manage clients based on their resource conditions. In addition, some techniques like asynchronous federated learning [13, 14] were studied. When the clients face different application scenarios or perform different tasks, it will also bring about the problem of model heterogeneity. Hence, [15] put forward Moreau Envelopes to perform personalized federated learning by introducing regularization loss function, which was helpful to separate personalized model optimization from the global model. Totally speaking, personalized federated learning methods for model heterogeneity can be divided into the following categories: Adding User Context [16], Mixture of Global and Local Models [17], Multi-task Learning [18], Meta-Learning [19], Knowledge Distillation [20], Base+Personalization Layers [21], Transfer Learning [22], and so on.

We mainly focus on the system heterogeneity problem under FL. The method proposed in this paper will transform a large model into small models of different complexities to meet the requirements of different clients by inserting branches in the middle layer.

Fast inference

The multi-branch fast inference model has received great attention from a large number of researchers and has been applied to a variety of task scenarios, such as image classification [23‐25], text ranking [26, 27], text classification [28], machine translation [29], key point detection [30], etc. In the multi-branch model, three key issues are mainly involved [31]: the design of the model, the training process, and the inference process. First of all, when it comes to designing a model, the number and positions of branches are two key factors. For example, some researchers chose to insert branches after a specific intermediate layer [5, 32], and some preferred to insert after each layer [33], but this will bring additional overhead [31]. Second, the training of multi-branch models can be roughly divided into trunk and branch collaborative training [5, 32] and separate training. The latter is usually more scalable. From the perspective of the training methods, knowledge distillation is a popular branch training method [28, 34, 35]. Under this method, the middle branches of the model are regarded as students, and the subsequent branch or the last exit is the teacher. Finally, in the inference process, an important issue is how to set the criteria for samples to exit the model. Currently, there are two ways, one is to preset the rules artificially [33, 36, 37], such as the loss threshold [5], and the other one is the learnable rules [38‐40].

In this paper, we will use the multi-branch models of lenet and resnet to perform experiments on the basis of the work of Teerapittayanon et al. [5]. During the training phase, we will conduct collaborative training on the trunk and branches. In the inference process, the number of samples withdrawn from each branch will be determined by the ratio of the branch accuracy in the training process to all branches.

Semi-supervised learning

A good deep learning model usually necessitates massive labeled data for training. In practice, however, labeling data takes a lot of time and effort, whereas unlabeled data are cheap and easy to get, leading to the development of semi-supervised learning [41]. Since the advent of semi-supervised learning in the 1970s [42‐44], it has received widespread attention [45‐47] because of its great advantage of leveraging unlabeled data. Currently, semi-supervised learning methods are mainly divided into generative models [48, 49], consistency regularization [50, 51], graph-based methods [52, 53], pseudo-labeling methods [54, 55], and hybrid methods [56, 57]. Particularly, the pseudo-labeling method is a very common method in semi-supervised learning, which is also the main focus of this paper. This method uses unlabeled data with high confidence as the training data to train the model. It can be combined with knowledge distillation [58], data augmentation [59], and other techniques to achieve extremely competitive results.

Although semi-supervised learning has been a research hotspot in past decades [41], as far as we know, a few researchers currently study how to implement semi-supervised learning under the framework of federated learning. Therefore, we will study the method of realizing semi-supervised learning under federated learning.

The proposed method: Semi-HFL

The main innovation of the method proposed in this paper is the heterogeneous federated learning framework and the semi-supervised learning method for it. Next, we will introduce Semi-HFL around these two aspects.

Heterogeneous federated learning

Under the framework of federated learning, the computing, communication, and storage resources of clients are different. For example, as a common smart device, a mobile phone usually has a running memory of 4–8G and storage space of 64–256G, while a portable computer can reach 64G and 2TB, respectively. Different resource conditions lead to different computing, communication, and storage capabilities. Therefore, how to maximize the use of various client resources, avoid clients being abandoned because of failing to meet the model training requirements, and improve resource utilization efficiency has become an urgent problem in heterogeneous federation learning.

Considering that the increase of the model depth will bring more computing, communication, and storage resource overhead, while a shallow model consumes less resources, we insert some exit branches in the neural network model to convert a single-exit global deep model to a multi-exit model. Based on each exit branch, we split the global deep model into several shallow models that can independently complete training and inference tasks to adapt to the resource requirements of different clients (as shown in Fig. 2). We assume that the number of layers of each branch is small enough that the computing resource overhead at each branch is less than that of further calculation in the deep network.

The training process

Taking the classification task as an example, for the training process of the above-mentioned multi-branch model, we suppose that the global model $\omega (t)$ has n branches, corresponding to n submodels. And t denotes the tth round of FL. The ith branch together with all the backbone networks before the branch constitutes the submodel ${\omega ^i}(t)$, thus forming a sequence of submodels $\left\{ {{\omega ^1}\left( t \right) ,{\omega ^2}\left( t \right) , \ldots ,{\omega ^n}\left( t \right) } \right\} $. In this paper, we assume that the complexity of ${\omega ^i}(t)$ is greater than that of ${\omega ^{(i-1)}}(t)$. In addition, we use ${\theta ^i}(t)$ to represent the backbone parameters of the ith submodel ${\omega ^i}(t)$, and ${\lambda ^i}(t)$ to represent the parameters of its branch. For the entire network backbone, there are n branches, splitting the model backbone into n parts. ${\theta ^{i(k)}}(t)$ is used to represent the kth part of the above splitted backbone in the ith model, which corresponds to the backbone between the kth branch and the $k-1$th branch in the global model (as shown in Fig. 2). Noting that in ${\theta ^{i(k)}}(t)$, the formula $k<=i$ always holds. For example, ${\theta ^{i(2)}}(t)$ is a part of ${\omega ^i}(t)$, and it corresponds to the backbone network between the first and second branches of $\omega (t)$. All clients of the same submodel constitute a client cluster. If the total number of clients is L, the number of clients under each client cluster is denoted by ${l_1},{l_2}, \ldots ,{l_n}$ respectively, where $L = {l_1} + {l_2} + \cdots + {l_n}$. Then, the local training process of federated learning can be expressed as

$$\begin{aligned} \omega _j^i\left( {t + 1} \right) = \omega _j^i\left( t \right) - \eta \nabla {\tilde{F}}_j^i(\omega _j^i\left( t \right) ),j = 1,2, \ldots , {l_i}, \end{aligned}$$

(1)

where ${\tilde{F}}_j^i(\omega _j^i\left( t \right) )$ is variation of the loss function and $\omega _j^i\left( t \right) $ is the jth client under the ith client cluster. Considering the data distribution between clients is not always independent and identical, i.e., data heterogeneity, to enable the locally trained model to not only integrate the features of local data but also avoid overfitting problems, referring to the work of Li et al. [6], we introduce a regularization term in the loss function of the local training to balance the local model and the global model, which can be expressed as

$$\begin{aligned} {\tilde{F}}_j^i(\omega _j^i(t)) = F_j^i(\omega _j^i(t)) + \frac{\mu }{2}{\left\| {\omega _j^i(t) - \left. {{\omega ^i}(t)} \right\| } \right. ^2}, \end{aligned}$$

(2)

where $F_j^i(\omega _j^i(t))$ is the local loss function. The local optimization task can be expressed as

$$\begin{aligned} \mathop {\min }\limits _{\omega _j^i(t)} {\tilde{F}}_j^i(\omega _j^i(t)) = F_j^i(\omega _j^i(t)) + \frac{\mu }{2}{\left\| {\omega _j^i(t) - \left. {{\omega ^i}(t)} \right\| } \right. ^2}. \end{aligned}$$

(3)

Hence, the global optimization goal is as follows:

$$\begin{aligned} \mathop {\min }\limits _\omega {\tilde{F}}(\omega (t))= & {} \sum \limits _{i = 1}^n {\sum \limits _{j = 1}^{{l_i}} {F_j^i(\omega _j^i(t))} } \nonumber \\&+ \frac{\mu }{2}\sum \limits _{i = 1}^n {\sum \limits _{j = 1}^{{l_i}} {{{\left\| {\omega _j^i(t) - \left. {{\omega ^i}(t)} \right\| } \right. }^2}}}. \end{aligned}$$

(4)

In our work, consistent with Li et al. [6], we set the value of $\mu $–0.3. After performing a predefined number of iterations, each client will upload the model to the server for aggregation. The aggregation process can be divided into two stages: homogeneous aggregation and heterogeneous aggregation. Homogeneous aggregation is the aggregation of models under the same client clusters. The specific process of homogeneous aggregation is as follows:

$$\begin{aligned} {\omega ^i}\left( {t + 1} \right) = \frac{1}{{{D^i}}}\sum \limits _{j = 1}^{{l_i}} {D_j^i\omega _j^i} \left( {t{\mathrm{+ }}1} \right) , \end{aligned}$$

(5)

where $D_j^i$ is the data size of the jth client under the ith cluster and ${D^i}$ is the data size of all clients under the ith cluster. Based on the result of homogeneous aggregation, heterogeneous aggregation aggregates the models between different client clusters. The model architectures of different clusters are different; meanwhile, the complexities are different, too. It can be divided into two parts: backbone aggregation and branch aggregation. Branch aggregation can be expressed as

$$\begin{aligned} {\lambda ^i}\left( {t + 1} \right) = \frac{1}{{{D^i}}}\sum \limits _{j = 1}^{{l_i}} {D_j^i\lambda _j^i(t + 1)}. \end{aligned}$$

(6)

Backbone aggregation is as follows:

$$\begin{aligned} {\theta ^{*(i)}}(t + 1) = \frac{1}{{\sum \nolimits _{k = i}^n {{D^k}} }}\sum \limits _{k = i}^n {{D^k}{\theta ^{k(i)}}(t+1)}, \end{aligned}$$

(7)

where ${\theta ^{*(i)}}(t + 1)$ is the backbone network parameters between the ith branch and the $i-1$th branch in the global model. Therefore, entire backbone network parameters are the union of all ${\theta ^{*(i)}}(t + 1)$

$$\begin{aligned} \theta (t + 1)&= {\theta ^{*(1)}}(t + 1) \cup {\theta ^{*(2)}}(t + 1)\nonumber \\&\quad \cup \cdots \cup {\theta ^{*(n)}}(t + 1). \end{aligned}$$

(8)

Hence, global model parameters are the union of backbone parameters and all the branch parameters

$$\begin{aligned} \omega (t + 1)&= \theta (t + 1) \cup {\lambda ^1}\left( {t + 1} \right) \cup {\lambda ^2}\left( {t + 1} \right) \nonumber \\&\quad \cup \cdots \cup {\lambda ^n}\left( {t + 1} \right) . \end{aligned}$$

(9)

The inference process

In the inference stage of the multi-branch model, we believe that the greater the test accuracy of each branch, the more likely it is to get correct inference results; thus, the number of exit samples (samples which are credible enough to early exit networks through branches) should be larger. Therefore, we adopt the proportional exit method to define the number of exit samples at each branch. That is, after training on the server, each branch’s accuracy in the aggregated large model is calculated. The ratio of each branch’s accuracy to the sum of all branches’ is the sample exit ratio

$$\begin{aligned} {p_i} = acc{_i}/\sum \limits _{j = 1}^n {acc{_j}}, \end{aligned}$$

(10)

where ${p_i}$ is the proportion of exit samples at branch i, and $\mathrm{acc}_i$ is the test accuracy of the submodel formed by branch i. We refer to the work of Teerapittayanon et al. [5], defining an entropy function as the standard to determine whether the sample exits the branch. When the entropy value is smaller, the calculated result is considered to be credible and more likely to exit early, vice versa. When the exit ratio of each branch is determined, all samples are sorted according to the entropy value, and only the sample with a sufficiently small entropy value can exit early through the branch. The mathematical formula of the entropy is as follows:

$$\begin{aligned} \mathrm{entropy}({\hat{y}}) = \sum \limits _{c \in C} {{{{\hat{y}}}_c}\log {{{\hat{y}}}_c}}, \end{aligned}$$

(11)

where C is the number of types.

Semi-supervised learning

In real scenarios, due to factors like labor costs, there is massive unlabeled data on the clients, while a limited amount of labeled data on the server. Hence, we further extended the above-mentioned heterogeneous learning method to semi-supervised scenarios. Based on the basic trick of knowledge distillation method, we design a multi-teacher to multi-student semi-supervised training method for the heterogeneous federated learning framework. Specifically, the whole semi-supervised FL contains four steps: supervised learning on the server, unsupervised learning on clients, federated aggregation, and model fine-tuning.

Supervised learning on the server

In the supervised learning stage, the server pretrains the global model with the labeled data ${D_L}$ and obtains test accuracies of all branch submodels. After pretraining, the model is decomposed as a series of submodels $\mathrm{Teachers} = \left\{ {{\omega ^1},{\omega ^2}, \ldots ,{\omega ^n}} \right\} $ that are distributed to the adapted client as teacher models.

Unsupervised learning on clients

The model obtained by the jth client under the ith cluster is $\omega _j^i$, and its test accuracy is $\mathrm{acc}_j^i$. We regard $\omega _j^i$ as the teacher used to predict the label of the client’s local unlabeled data, i.e., pseudo-labeled data. We assume that the higher the initial accuracy of the model, the more reliable the predicted labels are, and more pseudo-label data should participate in the training of the local model. Therefore, we take $\mathrm{acc}_j^i$ as the proportion to select local training data from the pseudo-labeled data. That is, if the amount of unlabeled data size is $D_j^i$ on the client side, $D_j^i*\mathrm{acc}_j^i$ number of data should be selected from the pseudo-labeled data for training. In all predicted labels, we use the entropy function defined in Eq. (11) as the basis for selecting samples. If the entropy value of the pseudo-label is small enough, it will be taken as training data for local learning. Then, we can get a series of student models of a specific submodel trained by different clients, which can be expressed as $\mathrm{Students}^{i} = \left\{ {\omega _1^i,\omega _2^i, \ldots ,\omega _{{l_i}}^i} \right\} $. All client student models can be expressed as $Students = \left\{ {\mathrm{Students}^{1},\mathrm{Students}^{2}, \ldots ,\mathrm{Students}^{n}} \right\} = \left\{ {\left\{ {\omega _1^1,\omega _2^1, \ldots ,\omega _{{l_1}}^1} \right\} , \ldots ,\left\{ {\omega _1^n,\omega _2^n, \ldots ,\omega _{{l_n}}^n} \right\} } \right\} $.

Federated aggregation

In the aggregation stage, similar to the method in Sect. 3.1, the corresponding parts in all student models are aggregated, respectively. Specifically, it can be expressed as the internally weighted aggregation of each student cluster, i.e., homogeneous aggregation, the result of which is $ \{{{{\overline{\mathrm{Students}}} }^1},{{{\overline{\mathrm{Students}}} }^2}, \ldots ,{{{\overline{\mathrm{Students}}} }^n}\}$. Then, the heterogeneous aggregation between student clusters is performed. Finally, we can get the global student model ${\overline{\mathrm{Students}}}$.

Model fine-tuning

After the global student model is obtained by federated aggregation, to prevent the pseudo-label data from causing the model to shift, we imitate the teacher’s guidance and correction behavior during the growth of the students, using labeled data stored on the server to finetune the model. After that, a new global model which is the teacher model in the next iteration is acquired.

The above process is repeated iteratively; thus, the student model grows step by step, performing better and better. Finally, the global model integrates all the features of labeled and unlabeled data.

Algorithm

To explain the methods mentioned in Sect. “The proposed method: Semi-HFL” more clearly, we furtherly provide their algorithms Heterogeneous FL and Semi-HFL in this section.

Hetergeneous FL

Algorithm 1 illustrates the whole process of federated training of the multi-branch model. The inputs include the number of federation round T, the number of local updates E, the number of heterogeneous client clusters n, the number of clients under each cluster, the learning rate $\eta $, and the data size of each client. The output is the global model. Before the training starts, the server divides the global multi-branch model into a series of single-exit submodels. Each branch corresponds to a single-exit submodel. Since branches are inserted at different points, the corresponding submodels have different complexities. Hence, the requirements for storage, computing and communication resources of submodels are different. We assume that the submodel series $\{ {{\omega ^1}\left( t \right) ,{\omega ^2}\left( t \right) , \ldots ,{\omega ^n}\left( t \right) }\}$ are sorted from shallow to deep by depth. The fifth line in algorithm 1 distributes submodels of different complexities to selected client clusters during one iteration. The model architecture of all clients under one cluster is the same. Each client uses local data to train the received submodel E times (line 14–20). After all selected clients have completed local training, the model will be aggregated (line 11). Specifically, the aggregation process is divided into homogeneous aggregation within clusters (line 22–25) and heterogeneous aggregation between clusters (line 26–37). The client first aggregates with those who have the same model as its own in the cluster. This aggregation process uses the FedAvg method (line 24) proposed by McMahan et al. [60]. That is, the model parameters are aggregated according to the amount of client data. After each resource, heterogeneous cluster gets an average submodel through the above process, and the branch and backbone networks will be aggregated between clusters. For the aggregation of the backbone network, we regard the backbone network as the connection of each part between adjacent branches. Therefore, the backbone network is composed of n parts, which can be expressed as $\{ {{\theta ^{*(1)}}\left( t \right) ,{\theta ^{*(2)}}\left( t \right) , \ldots ,{\theta ^{*(n)}}\left( t \right) }\}$. In particular, the first part of the backbone network in submodel 1 is ${\theta ^{1(1)}}\left( t \right) $, and the first part of the backbone network in submodel 2 can be expressed as ${\theta ^{2(1)}}\left( t \right) $. Since the depth of each model is different, not every part of the backbone network is included in every submodel. Therefore, we will aggregate each part of the backbone network separately (line 30). For the aggregation of branches, since the submodel corresponding to each branch is only distributed to one type of client cluster during model distribution and each type of branch only appears in one cluster, the aggregation method of the branches is the same as the homogeneous aggregation. Finally, the union of each part of the backbone network and branches constitutes a global multi-branch network (line 32 and line 37).

Semi-HFL

For more general scenarios with limited labeled data, Algorithm 2 further extends Algorithm 1 to semi-supervised scenarios, designing a semi-supervised learning method under the framework of heterogeneous federated learning. Assuming that there is a small amount of labeled data on the server and a large amount of unlabeled data on the clients. Compared with Algorithm 1, the output of Algorithm 2 remains unchanged, and the input adds server-side labeled data. Before distributing the heterogeneous model to clients, the server uses the labeled data for pretraining and obtains the accuracy of each branch $\{ {\mathrm{acc}^{1},\mathrm{acc}^{2}, \ldots ,\mathrm{acc}^{n}}\}$ (line 5). After the model is distributed to clients, the entropy values of all local samples are calculated (line 10), and the number of local training samples is obtained (line 11). Only samples with sufficiently small entropy values can be used as training samples to train the local model. The pseudo-label of training samples is the predicted values of the model downloaded from the server (line 12). After the local training is completed, similar to Algorithm 1, homogeneous aggregation and heterogeneous aggregation will be performed. However, different from Algorithm 1, to reduce the unreliability of local pseudo-labels, it is also necessary to use server-side labeled data to finetune the aggregated model. The fine-tuning method is the same as the pretraining method (lines 20–26).

Convergence analysis

In Eq.(9), the multi-branch large model obtained through heterogeneous aggregation is composed of various branches and a backbone network which is divided into n parts by the branches. Therefore, we will focus on various parts that make up the large model to prove the convergence of each part, and finally get the convergence of the global model. In general, the process of convergence analysis can be divided into two parts: branch and backbone network. Before convergence analysis, we make the following assumptions:

Assumption 1

During each round of federated iteration, the number of local training for each client remains the same, and the learning rate does not change.

Assumption 2

During a federated learning process, all clients will participate in the training.

Assumption 3

In the multi-branch model, since the number of branch layers is very small compared to the backbone, we assume that the decomposed parts of the backbone network can independently complete training and inference tasks. The relationship between them is the sequential input and output relationship. For example, for two adjacent parts, the output of the previous model is the input of the next model. During the federated learning process, they are distributed to each client for training.

Assumption 4

The local loss function ${F_j}$ is convex.

Assumption 5

The local loss function ${F_j}$ is M-smooth, which satisfies (i) ${F_j}(y) - {F_j}(x) - \left\langle {\nabla {F_j}(x),y - x} \right\rangle \le \frac{M}{2}\left\| {x - y} \right\| _2^2$; (ii) $\left\langle {\nabla {F_j}(x) - \nabla {F_j}(y),x - y} \right\rangle \ge \frac{1}{M}\left\| {\nabla {F_j}(x) - \nabla {F_j}(y)} \right\| _2^2$.

Assumption 6

The expectation and variance of the stochastic gradient of each client meet the following conditions:

$$\begin{aligned}&\mathbb {E}\left[ {{G_j}({\theta _j}(t,e))\mid {\theta _j}(t,e)} \right] = \nabla {F_j}\left( {{\theta _j}(t,e)} \right) ,\\&\mathbb {E}\left[ {{{\left\| {{G_j}({\theta _j}(t,e)) - \nabla {F_j}\left( {{\theta _j}(t,e)} \right) } \right\| }^2}\mid {\theta _j}(t,e)} \right] \le {\delta ^2}. \end{aligned}$$

Assumption 7

The L2-norm of the difference between local and global gradients has an upper bound

$$\begin{aligned} {\max _j}\mathop {\sup }\limits _\theta \left\| {\nabla {F_j}\left( {{\theta _j}(t,e)} \right) - \nabla F\left( {{\theta _j}(t,e)} \right) } \right\| \le \beta . \end{aligned}$$

For the analysis of the convergence of backbone network, we use mathematical induction method to analyze each part of the backbone network from front to back. The entire decomposed backbone network can be expressed as $\left\{ {{\theta ^{*(1)}}\left( t \right) ,{\theta ^{*(2)}}\left( t \right) , \ldots ,{\theta ^{*(n)}}\left( t \right) } \right\} $, and ${\theta ^{{{*}}(k)}}\left( t \right) $ is the kth part of the decomposed backbone. According to the induction method, we only need to analyze the following two points to prove the convergence of the global model:

Theorem 1

${\theta ^{{{*}}(1)}}\left( t \right) $ is convergent. That is, when $k=1$, ${\theta ^{{{*}}(k)}}\left( t \right) $ is convergent.

Theorem 2

When ${\theta ^{{{*}}(k)}}\left( t \right) $ is convergent, ${\theta ^{{{*}}(k+1)}}\left( t \right) $ converges, too.

Proof of Theorem 1

According Assumption 3, we believe that ${\theta ^{{{*}}(1)}}\left( t \right) $ is a submodel which can complete training and inference tasks independently. Because it is located at the beginning of the backbone model, during the training process, its input is client data and is not affected by the subsequent submodels. The proof of convergence is similar to traditional federated learning, taking FedAvg as an example, when all clients have the same amount of data

$$\begin{aligned} {{\overline{\theta }} ^{*(1)}}(t + 1)= & {} \frac{1}{L}\sum \limits _{j = 1}^L {\theta _j^{*(1)}(t + 1)}, \end{aligned}$$

(12)

$$\begin{aligned} F({{\overline{\theta }} ^{*(1)}}(t + 1))= & {} \frac{1}{L}\sum \limits _{j = 1}^L {{F_j}(\theta _j^{*(1)}(t + 1))}. \end{aligned}$$

(13)

Therefore, the key to proving the convergence lies in proving the gap between the average loss of local iteration and the global minimum loss (Eq. 14) decreases with the increase of iterations. That is, to prove the upper bound $B_k$ in Eq. 14 decreases with T increasing

$$\begin{aligned}&\mathbb {E}\left[ {\frac{1}{{TE}}\sum \limits _{t = 1}^T {\sum \limits _{e = 1}^E {F\left( {{{{\overline{\theta }} }^{*(1)}}(t,e)} \right) } } - F(\theta _\mathrm{best}^{{*(1)}})} \right] \nonumber \\&\qquad \le \mathrm{an \, upper \, bound } \, {{\mathrm{B}}_k}, \end{aligned}$$

(14)

where $F\left( {{{{\overline{\theta }} }^{*(1)}}(t,e)} \right) $ is the loss function value of the average model obtained by the eth local training in the tth federated learning, and $\theta _\mathrm{best}^{{*(1)}}$ is the global optimal parameters that minimize the loss function. To prove the above formula, the following two conditions need to be met: $\square $

Lemma 1

(Central learning) In the process of optimization, the model parameters should be continuously optimized, i.e., $\mathbb {E}\left[ {\frac{1}{E}\sum \limits _{e = 1}^E {F\left( {{{{\overline{\theta }} }^{*(1)}}(t,e)} \right) } - F\left( {\theta _\mathrm{best}^{{*(1)}}} \right) } \right] $ has an upper bound.

Lemma 2

(Distributed learning) The variation range of clients’ model parameters should be bounded, i.e., the variance of model parameters$\mathbb {E}\left[ {{{\left\| {\theta _j^{*(1)}(t,e) - {{{\overline{\theta }} }^{*(1)}}(t,e)} \right\| }^2}\mid {{{\mathcal {F}}}^{(t,0)}}} \right] $ in local learning process has an upper bound. ${{{\mathcal {F}}}^{(t,0)}}$ represents all the historical information before the start of the tth federated learning.

Lemma 1 is the condition of central learning, which indicates that each iteration process will make the model parameters closer to the optimal model parameters. Based on Lemma 1, Lemma 2 focuses on the characteristics of distributed learning, limiting the variation range of each client model parameter. Through analysis, we find that when the learning rate $\eta \le \frac{1}{{4M}}$

$$\begin{aligned} \mathbb {E}&\left[ {\left[ {\frac{1}{E}\sum \limits _{e = 1}^E {F\left( {{{{\overline{\theta }} }^{*(1)}}(t,e)} \right) } - F\left( {\theta _\mathrm{best}^{{*(1)}}} \right) } \right] \mid {{{\mathcal {F}}}^{(t,0)}}} \right] \nonumber \\&\quad \le \frac{1}{{2\eta E}}\left( {{{\left\| {{{{\overline{\theta }} }^{*(1)}}(t,0) - \theta _\mathrm{best}^{{*(1)}}} \right\| }^2} - } \right. \nonumber \\&\quad \quad -\mathbb {E} \left. {\left[ {{{\left\| {{{{\overline{\theta }} }^{*(1)}}(t,E) - \theta _\mathrm{best}^{{*(1)}}} \right\| }^2}\mid {{{\mathcal {F}}}^{(t,0)}}} \right] } \right) + \frac{{\eta {\delta ^2}}}{L} \nonumber \\&\quad \quad + \frac{M}{{LE}}\sum \limits _{j = 1}^L {\sum \limits _{e=1}^E } \mathbb {E}\left[ {\left\| {\theta _j^{*(1)}(t,e) - {{{\overline{\theta }} }^{*(1)}}} \right. } \right. \nonumber \\&\quad \quad {} {\left. {(t,e)} \right\| ^2}{} {} {} \left. {\mid {{{\mathcal {F}}}^{(t,0)}}} \right] ,\end{aligned}$$

(15)

$$\begin{aligned} \mathbb {E}&\left[ {{{\left\| {\theta _j^{*(1)}(t,e) - {{{\overline{\theta }} }^{*(1)}}(t,e)} \right\| }^2}\mid {{{\mathcal {F}}}^{(t,0)}}} \right] \nonumber \\&\quad \le 18{E^2}{\eta ^2}{\beta ^2} + 4E{\eta ^2}{\delta ^2}. \end{aligned}$$

(16)

The specific proofs of the above two formulas are shown in the Appendix. Combining Eqs. 15 and 16, we can get the following conclusion:

$$\begin{aligned}&\mathbb {E}\left[ {\frac{1}{{TE}}\sum \limits _{t = 1}^T {\sum \limits _{e = 1}^E {F\left( {{{{\overline{\theta }} }^{*(1)}}(t,e)} \right) } } - F(\theta _\mathrm{best}^{{*(1)}})} \right] \nonumber \\&\quad \le \frac{{{Q^2}}}{{2ET\eta }} + \frac{{\eta {\delta ^2}}}{L} + 4EM{\eta ^2}{\delta ^2}\nonumber \\&\qquad + 18M{E^2}{\eta ^2}{\beta ^2} \left( {\eta \le \frac{1}{{4M}}} \right) , \end{aligned}$$

(17)

where $Q = \left\| {{\theta ^{*(1)}}(0,0) - \theta _\mathrm{best}^{{*(1)}}} \right\| $. We can get the conclusion from Eq. 17 that as the number of iterations T increases, the upper bound of the gap between average local loss and the minimum loss continues to narrow, indicating that the model converges under the framework of federated learning.

Proof of Theorem 2

Further, we will analyze the convergence of ${\theta ^{{{*}}(k{\mathrm{+}}1)}}\left( t\right) $ when ${\theta ^{{{*}}(k)}}\left( t\right) $ is convergent. According Assumption 3, the input of ${\theta ^{{{*}}(k{\mathrm{+}}1)}}\left( t\right) $ is the output of ${\theta ^{{{*}}(k)}}\left( t\right) $, i.e., $input\left( {{\theta ^{{{*}}(k + 1)}}\left( {t,0} \right) } \right) = output\left( {{\theta ^{{{*}}(k)}}\left( {t,E} \right) } \right) $. If t is large enough, ${\theta ^{{{*}}(k)}}\left( t\right) $ will be convergent. When the input is given, the output of ${\theta ^{{{*}}(k)}}\left( t\right) $ will converge to a specific value; thus, the input of ${\theta ^{{{*}}(k+1)}}\left( t\right) $ is determined. If ${\theta ^{{{*}}(k+1)}}\left( t\right) $ is convergent, it should meet the following condition:

$$\begin{aligned}&\mathbb {E}\left[ {\frac{1}{{TE}}\sum \limits _{t = 1}^T {\sum \limits _{e = 1}^E {F\left( {{{{\overline{\theta }} }^{*(k + 1)}}(t,e)} \right) }}- F(\theta _\mathrm{best}^{{*(k + 1)}})} \right] \nonumber \\&\quad \quad \le {\mathrm{an \, upper \, bound}}\,{{\mathrm{B}}_{k + 1}}, \end{aligned}$$

(18)

where $B_{k+1}$ should decrease with T increasing. When the input of ${\theta ^{{{*}}(k+1)}}\left( t\right) $ is determined, similar to the analysis of ${\theta ^{{{*}}(1)}}\left( t\right) $, we can get that ${\theta ^{{{*}}(k+1)}}\left( t\right) $ is convergent. Therefore, the following results are concluded: (i) ${\theta ^{{{*}}(1)}}\left( t\right) $ is convergent, that is, when $k=1$, ${\theta ^{{{*}}(k)}}\left( t\right) $ converges; (ii) when ${\theta ^{{{*}}(k)}}\left( t\right) $ converges, ${\theta ^{{{*}}(k+1)}}\left( t\right) $ converges, too. According to the induction method, every part of the backbone network $\left\{ {{\theta ^{*(1)}}\left( t \right) ,{\theta ^{*(2)}}\left( t \right) , \ldots ,{\theta ^{*(n)}}\left( t \right) } \right\} $ converges; thus, the entire backbone network converges. $\square $

In addition to the backbone network, since a single-type branch network only exists under one client cluster, and it only participates in the homogeneous aggregation during the federated learning process, the convergence analysis is consistent with the traditional federated learning process, and please refer to the analysis of ${\theta ^{{{*}}(1)}}\left( t\right) $.

Experimental verification

In this paper, we will use the MNIST, Cifar10, MR [61], and Shakespeare [62] datasets to verify the effectiveness of Semi-HFL for image classification, text classification, and next-word prediction tasks, respectively. Specifically, this section is mainly divided into four parts: Semi-HFL feasibility verification, resource overhead study, ablation experiment, and extended experiment corresponding to Sect. “Semi-HFL feasibility verification”, “Two-level heterogeneity”, “Multi-level heterogeneity”, and “Resource overhead study”, respectively. In the Semi-HFL feasibility verification part, we will conduct separate experiments on the two-level and multi-level heterogeneity cases under the independent and identical distribution (IID) and non-independent and identical distribution (non-IID). Through Sect. “Semi-HFL feasibility verification”, we hope to find whether Semi-HFL can ensure the accuracy is not compromised compared to other method, and the impact of different heterogeneous cases on the final performance of the model. In Sect. “Two-level heterogeneity”, we only consider the IID distribution, further measuring the overhead of Semi-HFL in terms of storage, computing and communication resources. After that, in ablation experiments shown in Sect. “Multi-level heterogeneity”, we will explore the necessity of adding a regularization term to the client loss function. Finally, to test the generalization capability of Semi-HFL, we do extra multi-task experiments to verify it. The main comparison methods include FedAvg [60], FedProx [6], and FedProto [63], where FedAvg is homogeneous method, FedProx and FedProto are heterogeneous methods.

Table 1

The main experimental results

Data set	Indicator	1 + 2	1 + 3	2 + 3	1 + 2 + 3	Avg2	Avg3	Prox2	Prox3	Proto-2	Proto-3
MNIST	Size	195K	1210K	1230K	851K	280K	2450K	280K	2450K	4192K	4203K
	FLOPS	2281K	2563K	3675K	2817K	3689K	4229K	3689K	4229K	5067K	5127K
	Param	45K	303K	308K	212K	66K	617K	66K	617K	197K	197K
	iid(%)	98.67	98.55	98.72	98.15	97.32	97.00	97.37	97.57	98.61	98.72
	noniid(%)	98.48	98.59	98.64	98.37	97.09	97.25	97.34	97.26	98.25	97.78
Cifar10	Size	63 M	206 M	244 M	174 M	107 M	427 M	107 M	427 M	427 M	427 M
	FLOPS	3320 M	3900 M	4560 M	3874 M	4224 M	5567 M	4224 M	5567 M	374 M	384 M
	Param	16 M	54 M	64 M	45 M	28 M	112 M	28 M	112 M	1180K	1180K
	iid(%)	79.18	78.06	75.96	75.04	66.05	65.51	67.22	64.68	61.44	59.17
	noniid(%)	78.52	75.55	72.58	74.89	66.37	65.11	65.21	62.91	30.41	29.61
	Size	34.07 M	34.36 M	37.4 M	33.82 M	37.10 M	42.20 M	37.10 M	42.20 M	34.14 M	36.71 M
Shake-	FLOPS	601.32 M	598.22 M	605.36 M	598.08 M	643.97 M	645.3 M	643.97 M	645.3 M	643.20 M	643.87 M
speare	Param	8.83 M	8.91 M	9.59 M	8.72 M	9.49 M	11.05 M	9.49 M	11.05 M	1867K	1867K
	iid(%)	–	–	–	–	–	–	–	–	–	–
	noniid(%)	36.39	36.78	38.47	35.36	32.57	31.27	26.97	26.54	38.88	39.65
MR	Size	93 M	–	–	–	102 M	–	102 M	–	101 M	–
	FLOPS	82 M	–	–	–	89 M	–	89 M	–	705 M	–
	Param	1926K	–	–	–	2692K	–	2692K	–	22K	–
	iid(%)	61.82	–	–	–	59.75	–	59.56	–	59.87	–
	noniid(%)	61.07	–	–	–	57.97	–	59.19	–	58.78	–

Due to space limitations, the results of Avg 1 and Prox 1 are not listed

Regarding the processing method of non-IID distribution, we first sort all samples according to their labels, and then equally divide them into a predefined number of packages in order. Finally, each client selects the same number of packages as local samples. Specifically, in MNIST, in addition to 6000 labeled samples on the server, we divide the remaining 54,000 unlabeled samples into 250 packages in the order of labels. And each client randomly picks 5 of them as local data for training. Similarly, we divide the client training data of Cifar10 into 500 packages, MR into 100, and each client picks 10 and 2 packages, respectively, forming non-IID distributions. When it comes to the non-IID setting of Shakesprare dataset, borrowing the setting of [62], each device is equipped with the text data of only one role in the works of William Shakespeare. The models in this paper not only involve CNN models, and RNN models are also included. In the image classification task, the MNIST dataset adopts the Lenet model, and the Cifar10 dataset adopts the Resnet-18 model. In the text classification task, the model of MR is slightly deformed based on the model in [61]. And Shakespeare dataset uses a long short-term memory network (LSTM), a variant of LSTM in [62], for the next-word prediction. All the model structures are shown in Fig. 3. The optimizer used in the experiment is SGD, the learning rate of Cifar10 and MR is 0.1, MNIST’s is 0.05, and Shakespeare’s is 0.5. And totally, 50 clients are involved in the FL framework. In each round, 20% of them will be selected by the server to train collaboratively. Table 1 has listed the main experimental results of this paper.

Semi-HFL feasibility verification

In this section, we will try to insert branches in different positions of the model to form a multi-level heterogeneity, exploring the effectiveness of Semi-HFL in different heterogeneous cases. The specific evaluation indicator is model accuracy. In the experiments, we randomly select 10% out of the training data as the labeled data on the server and distribute the remaining 90% as unlabeled data to each client. Considering that in real distributed scenarios, most datasets may show two distributions between clients, i.e., IID and non-IID, we involve both distributions in MNIST, Cifar10, and MR datasets’ exploration. Noting that since Shakespeare is usually believed to be non-independently and identically distributed, it is only tested in the non-IID case. To ensure the fairness of the comparative experiments, we adopt the proposed semi-supervised learning method in the benchmarks.

Two-level heterogeneity

First, we explore the effectiveness of two-level federation heterogeneity, which means inserting one branch in the middle of the model. After the model is split according to the branch, two models with different computational complexities are formed. Both two models can independently complete the training and inference tasks. In the selection of the insert position of the early exit branch, we try to insert branches at two different positions to form two kinds of two-level heterogeneity. Therefore, we can test the effectiveness of Semi-HFL for the heterogeneous model formed by different insertion positions. The models corresponding to MNIST, Cifar10, and Shakespeare are shown in Fig. 3. The different positions of branches correspond to branch 1 and branch 2. And the two-level heterogeneous model has two cases: one is composed of branch 1 and branch 3 (represented by “1+3”); the other is composed of branch 2 and branch 3 (represented by “2+3”). The MR model is shown in Fig. 3, too. Since the model of MR is small, we will insert only one branch in the model, forming only one kind of two-level heterogeneous model. The experimental results of image and text datasets are shown in Figs. 4 and 5, respectively.

In the figures, “1 + 3” and “2 + 3” represent two kinds of two-level heterogeneous models, which are composed of branch 1 and branch 3, branch 2 and branch 3, respectively. “Avg1”, “Avg2”, and “Avg3” represent homogeneous learning frameworks, the models of which are only composed of branch 1, branch 2, and branch 3, respectively. And the aggregation method FedAvg is adopted in “Avg1”, “Avg2”, and “Avg3”. Similarly, the FedProx is adopted in “Prox1”, “Prox2”, and “Prox3”, and the models of them are similar to “Avg1”, “Avg2”, and “Avg3”. “Proto-2” denotes two-level heterogeneous Fedproto, i.e., the client has a choice of two kinds of models with different sizes. The results in Figs. 4 and 5 show that no matter which dataset or distribution is, in the case of using the same semi-supervised learning method, the model test accuracy obtained by Semi-HFL is not only not lower than that of others’, but also 1 percentage point higher on average in the MNIST dataset, 10 percentage points higher in the Cifar10 dataset, 1–5 percentage points higher in MR dataset, and up to 5 percentage points higher in Shakespeare dataset. This is because the heterogeneous federated learning method proposed in this paper divides the global model into submodels of different depths. In the local learning process, each submodel searches its optimal parameters without considering other parts of the network, thus reducing the coupling between various parts of the model and the constraints between parameters in the update process, and allowing the optimization of model parameters to be carried out in a larger search space. However, in other methods like FedAvg, FedProx, and FedProto, since the vertical structure of all models on the clients is the same, all parameters are updated in the direction that maximizes the accuracy of the last exit, which is a collaborative optimization process. Greater constraint and smaller search space result in the final model performance being inferior to Semi-HFL. Therefore, Figs. 4 and 5 show the feasibility and generalization of Semi-HFL to perform various tasks on image and text datasets to some degree. Additionally, we can find that the final convergence value of “1+3” is significantly better than that of “2 + 3” in Figs. 4c, d. This is because, under Cifar10, the model corresponding to branch 1 performs better than branch 2, which can be found in the performance comparison between “Avg1” and “Avg2”, or between “Prox1” and “Prox2” in the figures. Hence, if the insert locations of branches are different, the training results will be different, too. In general, if the branch model performs better under the homogeneous training method, the heterogeneous model composed of this branch is correspondingly better.

Multi-level heterogeneity

To further verify the feasibility of Semi-HFL in more complex heterogeneous situations, we increase the number of branches by inserting branches at branch 1 and branch 2 at the same time to form a multi-level heterogeneous model (represented by “1 + 2 + 3”). Since the MR model adopted in this paper is small, we only explore the multi-level heterogeneous situation for the MNIST, Cifar10, and Shakespeare datasets in Figs. 6, 7 and 8.

Figures 6a, b, 7a, b, and 8a show the results of MNIST and Cifar10 in IID and non-IID distributions, and Shakespeare’s in non-IID distribution. When the heterogeneous situation is more complex and there are more types of submodels, the models trained by Semi-HFL still have obvious advantages, which are about 1 percentage point higher on average in MNIST, and 10 percentage points higher in Cifar10. When it comes to Shakespeare, Semi-HFL is still 5 percentage points higher on average than FedAvg and FedProx. In addition, we also show the local accuracy trend of all clients before federated aggregation under different heterogeneous cases in Figs. 6c, d, 7c,d, and 8b. It can be found that within experimental ranges, the increase of the heterogeneity degree does not significantly affect the performances of models, that is, the final convergence value does not change significantly. However, the higher the degree of heterogeneity, the slower the convergence speed, and the greater the difference between local accuracies.

Resource overhead study

The initial motivation for introducing heterogeneous federated learning is to meet the heterogeneous needs of client storage, computing, communication, and other resources. Therefore, we will verify whether Semi-HFL consumes fewer resources than others. Since the data distributions do not directly affect the resource overhead, this section assumes that the data distribution is IID (Shakespeare is non-IID), and calculates the storage, computing, and communication resources consumed by four kinds of datasets under different computing methods. The storage, computing, and communication resources are measured by model size, FLOPS, and parameters, respectively. Besides, the resource overhead of FedAvg is close to FedProx, and we only compare Semi-HFL with FedAvg and FedProto. Finally, the test accuracy vs. resource overhead scatter plot (Fig. 9) is obtained. The dots in the figure represent the resource overhead of all clients participating in a certain round of training under Semi-HFL, yellow stars are the average resource cost and accuracy of FedProto, and red stars are FedAvg’s. It is worth noting that in the scatter plot of computing resource overhead, since each client was assigned the same number of samples in the experiment, we only calculate the computing resource cost of a single sample for each participating client.

It can be clearly seen from the figure that whether it is MNIST, Cifar10, MR, or Shakespeare, the accuracies of the models obtained by the heterogeneous training method are not only higher than those of the other training method but also significantly reduce the storage, computing, and communication resource overhead. This is because, in Semi-HFL, smaller submodels are trained and transmitted, thereby reducing the overall resource overhead. Meanwhile, it is worth noting the communication resource overhead of FedProto is smaller than others’ in most cases, it is because the model parameters are replaced by protos to be transmitted between clients and the server. In addition, in the MNIST, Cifar10, and Shakespeare datasets, it can also be found that when the heterogeneity level is the same, the larger the submodels are, the more resources are consumed overall. For example, the overall consumption of “2+3” is higher than that of “1 + 3”. Under different heterogeneous levels, the larger the proportions of shallow models are, the less the resource overhead is. For example, “1 + 2 + 3” consumes fewer resources than “2 + 3”, more resources than “1 + 2”.

Ablation experiment

In the Semi-HFL method proposed in this paper, since the locally trained submodel is a part of the global model, to prevent the gap between the submodel and the global model from being too large, we try to add a regularization term to local loss functions to achieve a balance between local models and the global model. To demonstrate that this approach is effective, we will conduct ablation experiments to validate it in this section. For brevity, we will pick one dataset from image and text datasets to validate, respectively.

We take the “1 + 3” heterogeneous situation as an example and obtain the experimental results shown in Fig. 10. Each subfigure corresponds to the local training results of the client with and without the regularization term under different distributions. The green line indicates the addition of the regularization term, while the blue one is not. It can be seen that in the Cifar10 (Fig. 10a), the clients with the regularization term are significantly better than without no matter in IID or non-IID distributions. Especially when the client data show a non-IID distribution, the gap between the two is more obvious. This is because the local data skew (i.e., non-IID) can easily lead to overfitting of locally trained models, resulting in unsatisfactory performance on the test dataset. When it come to text dataset MR (Fig. 10b), the distance of green line and blue line is small, indicating that adding the regularization term will not deteriorate the model performance. Therefore, based on the above test results, we believe that it is necessary to add a regularization term to the local loss function.

Extended experiment

To explore the generalization capability of Semi-HFL, we extend the above single-task verification to multi-task experiments under heterogeneous FL orchestration. In detail, there are two kinds of tasks in the FL framework, one is recognizing handwritten digits based on the MNIST dataset, and the other is recognizing costumes based on the FashionMNIST dataset. In the experimental settings, the number of clients is 100, 50 of which are requested to perform handwritten digit recognition tasks, and the remaining 50 perform costume recognition tasks. Each client has only one kind of dataset which is related to its task. Two kinds of recognition tasks can be calculated by a global aggregated model which is a variant of Lenet and is shown on the left of Fig. 6. Since there are 60,000 images for training and 10,000 images for testing both in MNIST and FashionMNIST datasets, we randomly choose 6,000 images from two datasets, respectively, as labeled data on the server, the rest 54,000 MNIST training images are averagely distributed to individuals that perform handwritten digit recognition tasks, and the remaining 54,000 FashionMNIST training images are distributed in the same way. The test dataset is composed of 20,000 images from MNIST and FashionMNIST. At the same time, we also consider IID and non-IID two distributions in multi-task federated learning. When the data distribution is non-IID, the data partition method is similar to MNIST in the above single-task experiments. The learning rate, optimizer, and other settings are similar to single-task experiments, too.

Figure 11 compares the performance of global models trained by Semi-HFL and other methods. It proves that Semi-HFL has great advantages over other methods no matter what the distributions are. Besides, we can also find that the overall performances of FedProx and FedProto are better than FedAvg. This is easy to understand; multi-task FL is essentially a kind of heterogeneous FL based on tasks. And FedProx was designed to solve the heterogeneity problem by adding a regularization term to the local loss function in FedAvg, FedProto tries to meet heterogeneous requirements by changing the structures of specific model layers. Therefore, the declining trend shown by the FedAvg curve is caused by the overfitting problem, which is also essentially a heterogeneous problem. That is exactly why we choose to add the regularization term in Semi-HFL. So far, we can believe that the Semi-HFL proposed in this paper is effective at least for single-task FL and multi-task FL.

Conclusions

This paper proposes Semi-HFL, a new heterogeneous federated learning framework based on semi-supervised learning for resource heterogeneity and unlabeled data challenges in federated learning, which is inspired by the multi-branch fast inference model. Specifically, by inserting early exit branches in the middle of the model, the original global unified model under the traditional federated learning framework is split into submodels adapted to diverse client computing, communication, and storage resources. During this process, a semi-supervised federated learning technique was created in account of the availability of labeled data, and a regularization term was introduced to the loss function to solve the overfitting problem of local models. For one thing, the framework can cater to clients’ personalized needs and provide a novel approach for solving the heterogeneous problem in the federated learning system. For another thing, unlabeled data are fully utilized, which greatly save labor costs. Through image classification, text classification, and next-word prediction experiments, it is proved that no matter what the data distribution is, the accuracy of the model trained by Semi-HFL is higher than that of homogeneous and heterogeneous methods, and at the same time, it consumes fewer resources. In addition, when the degree of heterogeneity increases, the convergence speed slows down and the variance of the client’s accuracy grows.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62002369 and 61872378, and in part by the Scientific Research Project of National University of Defense Technology under Grant ZK19-03.

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Multilayer Fisher extreme learning machine for classification

Nächster Artikel Enriched entity representation of knowledge graph for text generation

Proof of Lemma 1

According to Assumption 5, local loss function $F_j$ is M-smoothness, and we have the following:

$$\begin{aligned}&F_{j}\left( \bar{\theta }^{*(1)}(t, e+1)\right) \le F_{j}\left( \bar{\theta }_{j}^{*(1)}(t, e)\right) \nonumber \\&\quad +\left\langle \nabla F_{j}\left( \bar{\theta }_{j}^{*(1)}(t, e)\right) , \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }_{j}^{*(1)}(t, e)\right\rangle \nonumber \\&\quad +\frac{M}{2}\left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }_{j}^{*(1)}(t, e)\right\| ^{2}. \end{aligned}$$

(A1)

Because $F_j$ is also convex, then we can get

$$\begin{aligned}&F_{j}\left( \bar{\theta }^{*(1)}(t, e+1)\right) \le F_{j}\left( \theta _{b e s t}^{*(1)}\right) \nonumber \\&\qquad +\left\langle \nabla F_{j}\left( \bar{\theta }_{j}^{*(1)}(t, e)\right) , \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}\right\rangle \nonumber \\&\qquad +\frac{M}{2}\left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }_{j}^{*(1)}(t, e)\right\| ^{2} \nonumber \\&\quad \le F_{j}\left( \theta _{b e s t}^{*(1)}\right) \nonumber \\&\qquad +M\left\| \bar{\theta }_{j}^{*(1)}(t, e)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad +M\left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \nonumber \\&\qquad +\left\langle \nabla F_{j}\left( \bar{\theta }_{j}^{*(1)}(t, e)\right) , \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}\right\rangle . \end{aligned}$$

(A2)

Since ${{\bar{\theta }} ^{*(1)}}(t,e + 1) = {{\bar{\theta }} ^{*(1)}}(t,e) - \eta \frac{1}{L}\sum \limits _{j = 1}^L {{G_j}(\theta _j^{*(1)}(t,e))}$, the following equation is true:

$$\begin{aligned}&\frac{1}{L} \sum _{j=1}^{L}\left\langle G_{j}\left( \theta _{j}^{*(1)}(t, e)\right) , \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}\right\rangle \nonumber \\&\quad =-\frac{1}{\eta }\left\langle \bar{\theta }^{*(1)}(t, e\!+\!1)-\bar{\theta }^{*(1)}(t, e), \bar{\theta }^{*(1)}(t, e\!+\!1)\!-\!\theta _{b e s t}^{*(1)}\right\rangle \nonumber \\&\quad =-\frac{1}{2 \eta }\left\langle \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e), 2 \bar{\theta }^{*(1)}(t, e+1)\right. \nonumber \\&\qquad \left. -\bar{\theta }^{*(1)}(t, e)+\bar{\theta }^{*(1)}(t, e)-2 \theta _{b e s t}^{*(1)}\right\rangle \nonumber \\&\quad =-\frac{1}{2 \eta }\left( \left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\right. \nonumber \\&\qquad + \left\langle \bar{\theta }^{*(1)}(t, e+1)\right. -\theta _{b e s t}^{*(1)}+\theta _{b e s t}^{*(1)}-\bar{\theta }^{*(1)}(t, e),\nonumber \\&\qquad \left. \left. \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}+\bar{\theta }^{*(1)}(t, e)-\theta _{b e s t}^{*(1)}\right\rangle \right) \nonumber \\&\quad =-\frac{1}{2 \eta }\left( \left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\right. \nonumber \\&\qquad \left. +\left\| \bar{\theta }^{*(1)}(t, e\!+\!1)-\theta _{b e s t}^{*(1)}\right\| ^{2}\!-\!\Vert \bar{\theta }^{*(1)}(t, e)\!-.\theta _{b e s t}^{*(1)} \Vert ^{2}\!\right) . \nonumber \\ \end{aligned}$$

(A3)

Combining Eqs. A2 and A3 yields

$$\begin{aligned}&F\left( \theta _{j}^{*(1)}(t, e+1)\right) -F\left( \theta _{\text{ best } }^{*(1)}\right) \nonumber \\&\quad =\frac{1}{L} \sum _{j=1}^{L}\left( F_{j}\left( \bar{\theta }^{*(1)}(t, e+1)\right) -F\left( \theta _{\text{ best } }^{*(1)}\right) \right) \nonumber \\&\quad \le \frac{1}{L} \sum _{j=1}^{L}\left\langle \nabla F_{j}\left( \theta _{j}^{*(1)}(t, e)\right) \right. \nonumber \\&\qquad \left. -G_{j}\left( \theta _{j}^{*(1)}(t, e)\right) , \bar{\theta }^{*(1)}(t, e+1)-\theta _{\text{ best } }^{*(1)}\right\rangle \nonumber \\&\qquad +M\left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad +\frac{M}{L} \sum _{j=1}^{L}\left\| \theta _{j}^{*(1)}(t, e)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \nonumber \\&\qquad +\frac{1}{2 \eta }\left( \left\| \bar{\theta }^{*(1)}(t, e)-\theta _{\text{ best } }^{*(1)}\right\| ^{2}-\left\| \bar{\theta }^{*(1)}(t, e+1)-\theta _{\text{ best } }^{*(1)}\right\| ^{2}\right. \nonumber \\&\qquad \left. -\left\| \bar{\theta }^{*(1)}(t, e+1 -\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\right) . \end{aligned}$$

(A4)

Because $\mathbb {E}\left[ {\nabla {F_j}\left( {\theta _j^{*(1)}(t,e)} \right) \!-\! {G_j}\left( {\theta _j^{*(1)}(t,e)} \right) \mid {{{\mathcal {F}}}^{(t,e)}}} \right] = 0$, we can get

$$\begin{aligned}&\mathbb {E}\left[ \frac{1}{L} \sum _{j=1}^{L}\left\langle \nabla F_{j}\left( \theta _{j}^{*(1)}(t, e)\right) -G_{j}\left( \theta _{j}^{*(1)}(t, e)\right) ,\right. \right. \nonumber \\&\qquad \left. \left. \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}\right\rangle \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\quad = \mathbb {E}\left[ \frac{1}{L} \sum _{j=1}^{L}\left\langle \nabla F_{j}\left( \theta _{j}^{*(1)}(t, e)\right) -G_{j}\left( \theta _{j}^{*(1)}(t, e)\right) ,\right. \right. \nonumber \\&\qquad \left. \left. \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\rangle \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\quad \le \eta \cdot \mathbb {E}\left[ \left\| \frac{1}{L} \sum _{j=1}^{L}\left( \nabla F_{j}\left( \theta _{j}^{*(1)}(t, e)\right) \right. \right. \right. \nonumber \\&\qquad \left. \left. \left. -G_{j}\left( \theta _{j}^{*(1)}(t, e)\right) \right) \right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\qquad +\frac{1}{4 \eta } \cdot \mathbb {E}\left[ \left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\qquad (\text {Mean inequality}) \nonumber \\&\quad \le \frac{\eta \delta ^{2}}{L}+\frac{1}{4 \eta } \cdot \mathbb {E}\left[ \left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] . \nonumber \\ \end{aligned}$$

(A5)

Combining Eqs. A4 and A5, when $\eta \le \frac{1}{{4M}}$, yields

$$\begin{aligned}&\mathbb {E}\left[ F\left( \bar{\theta }^{*(1)}(t, e+1)\right) -F\left( \theta _{b e s t}^{*(1)}\right) \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\quad \le \frac{\eta \delta ^{2}}{L}-\left( \frac{1}{4 \eta }-M\right) \nonumber \\&\mathbb {E}\left[ \left\| \bar{\theta }^{*(1)}(t, e+1)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\qquad +\frac{M}{L} \sum _{j=1}^{L}\left\| \theta _{j}^{*(1)}(t, e)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad -\frac{1}{2 \eta }\left( \mathbb {E}\left[ \left\| \bar{\theta }^{*(1)}(t, e+1)-\theta _{b e s t}^{*(1)}\right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] \right. \nonumber \\&\qquad \left. -\left\| \bar{\theta }^{*(1)}(t, e)-\theta _{b e s t}^{*(1)}\right\| ^{2}\right) \nonumber \\&\quad \le \frac{\eta \delta ^{2}}{L}+\frac{M}{L} \sum _{j=1}^{L}\left\| \theta _{j}^{*(1)}(t, e)-\bar{\theta }^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad -\frac{1}{2 \eta }\left( \mathbb { E } \left[ \left\| \bar{\theta }^{*(1)}(t, e+1)-\theta _{bes t}^{*(1)}\right\| ^{2}\mid \mathcal {F}^{(t, e)}\right] \right. \nonumber \\&\qquad \left. -\left\| \bar{\theta }^{*(1)}(t, e)-\theta _{b e s t}^{*(1)}\right\| ^{2}\right) . \end{aligned}$$

(A6)

The conclusion in Eq. A6 is for client’s single training process. If it is extended to the entire local training, we get the conclusion

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left[ \left[ \frac{1}{E} \sum _{e=1}^{E} F\left( \bar{\theta }^{*(1)}(t, e)\right) -F\left( \theta _{b e s t}^{*(1)}\right) \right] \mid \mathcal {F}^{(t, 0)}\right] \\&\quad \le \frac{\eta \delta ^{2}}{L}+\frac{1}{2 \eta E}\left( \left\| \bar{\theta }^{*(1)}(t, 0)-\theta _{b e s t}^{*(1)}\right\| ^{2}\right. \\&\qquad \left. -\mathbb {E}\left[ \left\| \bar{\theta }^{*(1)}(t, E)-\theta _{b e s t}^{*(1)}\right\| ^{2} \mid \mathcal {F}^{(t, 0)}\right] \right) \\&\qquad +\frac{M}{L E} \sum _{j=1}^{L} \sum _{e=1}^{E} \mathbb {E}\left[ \left\| \theta _{j}^{*(1)}(t, e)\qquad -\bar{\theta }^{*(1)}(t, e)\right\| ^{2} \mid \mathcal {F}^{(t, 0)}\right] . \end{aligned}\nonumber \\ \end{aligned}$$

(A7)

Proof of Lemma 2

For any two clients, the following conditions are met. Take client 1 and client 2 as examples:

$$\begin{aligned}&\mathbb {E}\left[ {{{\left\| {\theta _1^{*(1)}(t,e{\mathrm{+ }}1) - \theta _2^{*(1)}(t,e{\mathrm{+ }}1)} \right\| }^2}\mid {{{\mathcal {F}}}^{(t,e)}}} \right] \nonumber \\&\quad = \mathbb {E}\left[ \left\| \theta _1^{*(1)}(t,e) - \theta _2^{*(1)}(t,e) - \eta \left( {G_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \right. \right. \nonumber \\&\qquad \left. \left. \left. - {G_2}\left( {\theta _2^{*(1)}(t,e)} \right) \right) \right\| ^2\mid {{{\mathcal {F}}}^{(t,{\mathrm{e}})}} \right] \nonumber \\&\quad = \mathbb {E}\left[ {\left\| {\theta _1^{*(1)}(t,e) - \theta _2^{*(1)}(t,e) - \eta } \right. } \right. \left( {G_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \nonumber \\&\qquad \left. + \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) \right. \nonumber \\&\qquad \left. \left. \left. - {G_2}\left( {\theta _2^{*(1)}(t,e)} \right) - \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \right. \right. \nonumber \\&\qquad \left. \left. \left. + \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) \right) \right\| ^2\mid {{{\mathcal {F}}}^{(t,{\mathrm{e}})}} \right] \nonumber \\&\quad = \mathbb {E}\left[ \left\| \theta _1^{*(1)}(t,e) - \theta _2^{*(1)}(t,e) - \eta \left( \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \right. \right. \nonumber \\&\qquad \left. \left. \left. - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) \right) \right. \right. - \eta \left( {{G_1}} \right. \nonumber \\&\qquad \left. {\left( {\theta _1^{*(1)}(t,e)} \right) - \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) } \right) \nonumber \\&\qquad \left. - \eta \left( {\nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) - {G_2}\left( {\theta _2^{*(1)}(t,e)} \right) } \right) \right\| ^2 \left. {\mid {{{\mathcal {F}}}^{(t,{\mathrm{e}})}}} \right] \nonumber \\&\quad \le {\left\| {\theta _1^{*(1)}(t,e) - \theta _2^{*(1)}(t,e)} \right\| ^2} - 2\eta \left\langle \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \nonumber \\&\qquad \left. - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) ,\theta _1^{*(1)}(t,e) \right. \nonumber \\&\qquad \left. { - \theta _2^{*(1)}(t,e)} \right\rangle + {\eta ^2}\left\| \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) \right. \nonumber \\&\qquad \left. - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) \right\| ^2 + 2{\eta ^2}{\delta ^2}, \end{aligned}$$

(B8)

where the term $- \left\langle \nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) , \right. \left. \theta _1^{*(1)}(t,e) - \theta _2^{*(1)}(t,e) \right\rangle $ is bounded as

$$\begin{aligned}&-\left\langle \nabla F_{1}\left( \theta _{1}^{*(1)}(t, e)\right) -\nabla F_{2}\left( \theta _{2}^{*(1)}(t, e)\right) ,\right. \nonumber \\&\qquad \left. \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\rangle \nonumber \\&\quad \le -\left\langle \nabla F_{1}\left( \theta _{1}^{*(1)}(t, e)\right) -\nabla F_{2}\left( \theta _{2}^{*(1)}(t, e)\right) ,\right. \nonumber \\&\qquad \qquad \left. \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\rangle +2 \beta \nonumber \\&\qquad \qquad \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| \nonumber \\&\quad \le -\frac{1}{M}\left\| \nabla F\left( \theta _{1}^{*(1)}(t, e)\right) -\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\nonumber \\&\qquad +2 \beta \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| \nonumber \\&\qquad \text{(Assumption.5 } \text{ property(ii)) } \nonumber \\&\quad \le -\frac{1}{M}\left\| \nabla F\left( \theta _{1}^{*(1)}(t, e)\right) -\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\nonumber \\&\qquad +\frac{1}{2 \eta E}\left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2} \nonumber \\&\qquad +2 \eta E \beta ^{2}, \text{(Mean } \text{ inequality) } \end{aligned}$$

(B9)

the term ${\left\| {\nabla {F_1}\left( {\theta _1^{*(1)}(t,e)} \right) - \nabla {F_2}\left( {\theta _2^{*(1)}(t,e)} \right) } \right\| ^2}$ is bounded as

$$\begin{aligned}&\left\| \nabla F_{1}\left( \left( \theta _{1}^{*(1)}(t, e)\right) \right) -\nabla F_{2}\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\nonumber \\&\quad \le \Vert \nabla F_{1}\left( \left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) \right) -\nabla F_{2}\left( \theta _{2}^{*(\mathrm {1})}(t, e)\right) \nonumber \\&\qquad +\nabla F\left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) -\nabla F\left( \theta _{2}^{*(\mathrm {1})}(t, e)\right) \nonumber \\&\qquad -\nabla F\left( \theta _{1}^{*(1)}(t, e)\right) +\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \Vert ^{2}\nonumber \\&\quad \le \left\| \nabla F\left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) -\nabla F\left( \theta _{2}^{*(\mathrm {1})}(t, e)\right) \right\| ^{2}\nonumber \\&\qquad +\left\| \nabla F_{1}\left( \left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) \right) -\nabla F\left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) \right\| ^{2}\nonumber \\&\qquad +\left\| \nabla F\left( \theta _{2}^{*(1)}(t, e)\right) -\nabla F_{2}\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\nonumber \\&\quad \le \left\| \nabla F\left( \theta _{1}^{*(\mathrm {1})}(t, e)\right) -\nabla F\left( \theta _{2}^{*(\mathrm {1})}(t, e)\right) \right\| ^{2}+2 \beta ^{2}\nonumber \\&\quad \le 3\left\| \nabla F\left( \left( \theta _{1}^{*(1)}(t, e)\right) \right) -\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}+6 \beta ^{2}. \end{aligned}$$

(B10)

By combining Eqs. B8, B9, and B10, we get the following conclusion:

$$\begin{aligned}&\mathbb {E}\left[ \left\| \theta _{1}^{*(1)}(t, e+1)-\theta _{2}^{*(1)}(t, e+1)\right\| ^{2} \mid \mathcal {F}^{(t, e)}\right] \nonumber \\&\quad \le \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2}+2 \eta \left[ -\frac{1}{M}\left\| \nabla F\left( \theta _{1}^{*(1)}(t, e)\right) \right. \right. \nonumber \\&\qquad \left. \left. -\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\right. \nonumber \\&\qquad + \left. \frac{1}{2 \eta E}\left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2}+2 \eta E \beta ^{2}\right] \nonumber \\&\qquad +\eta ^{2}\left( 3\left\| \nabla F\left( \theta _{1}^{*(1)}(t, e)\right) -\nabla F\left( \theta _{2}^{*(1)}(t, e)\right) \right\| ^{2}\right. \nonumber \\&\qquad \left. +6 \beta ^{2}\right) +2 \eta ^{2} \delta ^{2} \nonumber \\&\quad =\left( 1+\frac{1}{E}\right) \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad +\left( 3 \eta ^{2}-\frac{2 \eta }{M}\right) \Vert \nabla F\left( \theta _{1}^{*}(1)(t, e)\right) -\nabla F\left( \theta _{2}^{*(1)}(t, e)) \Vert ^{2}\right. \nonumber \\&\qquad +4 \eta ^{2} E \beta ^{2}+6 \eta ^{2} \beta ^{2}+2 \eta ^{2} \delta ^{2} \nonumber \\&\quad \le \left( 1+\frac{1}{E}\right) \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad +4 E \eta ^{2} \beta ^{2}+6 \eta ^{2} \beta ^{2}+2 \eta ^{2} \delta ^{2} \left( \eta <\frac{1}{4 M}\right) \nonumber \\&\quad \le \left( 1+\frac{1}{E}\right) \left\| \theta _{1}^{*(1)}(t, e)-\theta _{2}^{*(1)}(t, e)\right\| ^{2}\nonumber \\&\qquad +10 E \eta ^{2} \beta ^{2}+2 \eta ^{2} \delta ^{2} \quad (E \ge 1). \end{aligned}$$

(B11)

Wu Q, He K, Chen X (2020) Personalized federated learning for intelligent iot applications: a cloud-edge based framework. IEEE Open J Comput Soc 1:35–44CrossRef

Diao E, Ding J, Tarokh V (2020) Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In: International Conference on Learning Representations

Wang J, Charles Z, Xu Z, Joshi G, McMahan HB, Al-Shedivat M, Andrew G, Avestimehr S, Daly K, Data D, et al (2021) A field guide to federated optimization. arXiv preprint arXiv:2107.06917

Nishio T, Yonetani R (2019) Client selection for federated learning with heterogeneous resources in mobile edge. In: ICC 2019-2019 IEEE International Conference on Communications (ICC), pp. 1–7. IEEE

Teerapittayanon S, McDanel B, Kung H-T (2016) Branchynet: fast inference via early exiting from deep neural networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. IEEE

Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) Federated optimization in heterogeneous networks. Proc Mach Learn Syst 2:429–450

Zhu H, Zhang H, Jin Y (2021) From federated learning to federated neural architecture search: a survey. Complex Intell Syst 7(2):639–657CrossRef

Zhang L, Zhang Z, Guan C (2021) Accelerating privacy-preserving momentum federated learning for industrial cyber-physical systems. Complex Intell Syst 7(6):3289–3301CrossRef

Zhang Q, Lu J, Jin Y (2021) Artificial intelligence in recommender systems. Complex Intell Syst 7(1):439–457CrossRef

10.

Wang L, Xu S, Wang X, Zhu Q (2020) Addressing class imbalance in federated learning. arXiv preprint arXiv:2008.06217

11.

Zhang S, Li Z, Chen Q, Zheng W, Leng J, Guo M (2021) Dubhe: towards data unbiasedness with homomorphic encryption in federated learning client selection. Association for Computing Machinery, New York

12.

Collins L, Hassani H, Mokhtari A, Shakkottai S (2021) Exploiting shared representations for personalized federated learning. arXiv preprint arXiv:2102.07078

13.

Bonawitz K, Eichner H, Grieskamp W, Huba D, Ingerman A, Ivanov V, Kiddon C, Konečnỳ J, Mazzocchi S, McMahan HB et al (2019) Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046

14.

Xie C, Koyejo S, Gupta I (2020) Asynchronous federated optimization

15.

Dinh CT, Tran NH, Nguyen TD (2020) Personalized federated learning with moreau envelopes. arXiv preprint arXiv:2006.08848

16.

Mansour Y, Mohri M, Ro J, Suresh AT (2020) Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619

17.

Hanzely F, Richtárik P (2020) Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516

18.

Smith V, Chiang C-K, Sanjabi M, Talwalkar A (2017) Federated multi-task learning. arXiv preprint arXiv:1705.10467

19.

Jiang Y, Konečnỳ J, Rush K, Kannan S (2019) Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488

20.

Li D, Wang J (2019) Fedmd: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581

21.

Arivazhagan MG, Aggarwal V, Singh AK, Choudhary S (2019) Federated learning with personalization layers. arXiv preprint arXiv:1912.00818

22.

Schneider J, Vlachos M (2020) Personalization of deep learning

23.

Wang M, Mo J, Lin J, Wang Z, Du L (2019) Dynexit: a dynamic early-exit strategy for deep residual networks. In: 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 178–183. IEEE

24.

Wang Y, Shen J, Hu T-K, Xu P, Nguyen T, Baraniuk R, Wang Z, Lin Y (2020) Dual dynamic inference: enabling more efficient, adaptive, and controllable deep inference. IEEE J Select Top Signal Process 14(4):623–633CrossRef

25.

Yang L, Han Y, Chen X, Song S, Dai J, Huang G (2020) Resolution adaptive networks for efficient inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2369–2378

26.

Soldaini L, Moschitti A (2020) The cascade transformer: an application for efficient answer sentence selection. arXiv preprint arXiv:2005.02534

27.

Xin J, Nogueira R, Yu Y, Lin J (2020) Early exiting bert for efficient document ranking. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 83–88

28.

Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178

29.

Elbayad M, Gu J, Grave E, Auli M (2019) Depth-adaptive transformer. arXiv preprint arXiv:1910.10073

30.

Matsubara Y, Levorato M (2021) Neural compression and filtering for edge-assisted real-time object detection in challenged networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2272–2279. IEEE

31.

Laskaridis S, Kouris A, Lane ND (2021) Adaptive inference through early-exit networks: design, challenges and directions. arXiv preprint arXiv:2106.05022

32.

Teerapittayanon S, McDanel B, Kung H-T (2017) Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328–339. IEEE

33.

Zhou W, Xu C, Ge T, McAuley J, Xu K, Wei F (2020) Bert loses patience: fast and robust inference with early exit. arXiv preprint arXiv:2006.04152

34.

Leontiadis I, Laskaridis S, Venieris SI, Lane ND (2021) It’s always personal: using early exits for efficient on-device cnn personalisation. In: Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications, pp. 15–21

35.

Li H, Zhang H, Qi X, Yang R, Huang G (2019) Improved techniques for training adaptive deep networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1891–1900

36.

Berestizshevsky K, Even G (2019) Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: International Conference on Artificial Neural Networks, pp. 306–320. Springer

37.

Gormez A, Koyuncu E (2021) Class means as an early exit decision mechanism. arXiv preprint arXiv:2103.01148

38.

Chen X, Dai H, Li Y, Gao X, Song L (2020) Learning to stop while learning to predict. In: International Conference on Machine Learning, pp. 1520–1530. PMLR

39.

Dai X, Kong X, Guo T (2020) Epnet: Learning to exit with flexible multi-branch network. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pp. 235–244

40.

Scardapane S, Comminiello D, Scarpiniti M, Baccarelli E, Uncini A (2020) Differentiable branching in deep networks for fast inference. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4167–4171. IEEE

41.

Yang X, Song Z, King I, Xu Z (2021) A survey on deep semi-supervised learning. arXiv preprint arXiv:2103.00550

42.

Agrawala A (1970) Learning with a probabilistic teacher. IEEE Trans Inform Theory 16(4):373–379MathSciNetCrossRefMATH

43.

Fralick S (1967) Learning to recognize patterns without a teacher. IEEE Trans Inform Theory 13(1):57–64CrossRef

44.

Scudder H (1965) Probability of error of some adaptive pattern-recognition machines. IEEE Trans Inform Theory 11(3):363–371MathSciNetCrossRefMATH

45.

Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inform Process Syst 34

46.

Li D, Dick S (2022) Semi-supervised multi-label classification using an extended graph-based manifold regularization. Complex Intell Syst:1–17

47.

Mandapati S, Kadry S, Kumar RL, Sutham K, Thinnukool O (2022) Deep learning model construction for a semi-supervised classification with feature learning. Complex Intell Syst:1–11

48.

Miller DJ, Uyar H (1996) A mixture of experts classifier with learning based on both labelled and unlabelled data. Adv Neural Inform Process Syst 9

49.

Odena A (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583

50.

Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Nips 14:585–591

51.

Ke Z, Wang D, Yan Q, Ren J, Lau RW (2019) Dual student: breaking the limits of the teacher in semi-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6728–6736

52.

Chen P, Ma T, Qin X, Xu W, Zhou S (2020) Data-efficient semi-supervised learning by reliable edge mining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9192–9201

53.

Li S, Liu B, Chen D, Chu Q, Yuan L, Yu N (2020) Density-aware graph for deep semi-supervised visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13400–13409

54.

Zhou Z-H, Li M (2010) Semi-supervised learning by disagreement. Knowl Inform Syst 24(3):415–439CrossRef

55.

Qiao S, Shen W, Zhang Z, Wang B, Yuille A (2018) Deep co-training for semi-supervised image recognition. In: Proceedings of the European Conference on Computer Vision (eccv), pp. 135–152

56.

Berthelot D, Carlini N, Cubuk ED, Kurakin A, Sohn K, Zhang H, Raffel C (2019) Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785

57.

Li J, Socher R, Hoi SC (2020) Dividemix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394

58.

Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698

59.

Sohn K, Berthelot D, Li C-L, Zhang Z, Carlini N, Cubuk ED, Kurakin A, Zhang H, Raffel C (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685

60.

McMahan B, Moore E, Ramage D, Hampson S, Arcas BAy (2017) Communication-Efficient Learning of Deep Networks from Decentralized Data. In: Singh A, Zhu J (eds) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. In: Proceedings of Machine Learning Research, vol. 54, pp. 1273–1282. PMLR

61.

Kim Y (2014) Convolutional neural networks for sentence classification

62.

Caldas S, Duddu SMK, Wu P, Li T, Konečnỳ J, McMahan HB, Smith V, Talwalkar A (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097

63.

Tan Y, Long G, Liu L, Zhou T, Lu Q, Jiang J, Zhang C (2022) Fedproto: federated prototype learning across heterogeneous clients. AAAI Conf Artif Intell 1:3

Titel: Semi-HFL: semi-supervised federated learning for heterogeneous devices
verfasst von: Zhengyi Zhong
Ji Wang
Weidong Bao
Jingxuan Zhou
Xiaomin Zhu
Xiongtao Zhang
Publikationsdatum: 04.11.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 2/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00894-4

Springer Professional

Semi-HFL: semi-supervised federated learning for heterogeneous devices

Abstract

Publisher's Note

Introduction

Heterogeneous federated learning

Fast inference

Semi-supervised learning

The proposed method: Semi-HFL

Heterogeneous federated learning

The training process

The inference process

Semi-supervised learning

Supervised learning on the server

Unsupervised learning on clients

Federated aggregation

Model fine-tuning

Algorithm

Hetergeneous FL

Semi-HFL

Convergence analysis

Experimental verification

Semi-HFL feasibility verification

Two-level heterogeneity

Multi-level heterogeneity

Resource overhead study

Ablation experiment

Extended experiment

Conclusions

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Proof of Lemma 1

Proof of Lemma 2

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Heterogeneous federated learning

Fast inference

Semi-supervised learning

The proposed method: Semi-HFL

Heterogeneous federated learning

The training process

The inference process

Semi-supervised learning

Supervised learning on the server

Unsupervised learning on clients

Federated aggregation

Model fine-tuning

Algorithm

Hetergeneous FL

Semi-HFL

Convergence analysis

Experimental verification

Semi-HFL feasibility verification

Two-level heterogeneity

Multi-level heterogeneity

Resource overhead study

Ablation experiment

Extended experiment

Conclusions

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Proof of Lemma 1

Proof of Lemma 2

Weitere Artikel der Ausgabe 2/2023

Accelerating multi-objective neural architecture search by random-weight evaluation

A special point-based transfer component analysis for dynamic multi-objective optimization

Multi-objective evolutionary design of central pattern generator network for biomimetic robotic fish

Custom CornerNet: a drone-based improved deep learning technique for large-scale multiclass pest localization and classification

Using dual evolutionary search to construct decision tree based ensemble classifier

Multiple spatial residual network for object detection

Premium Partner