Introduction
-
We propose FedFreezeBERT, a novel framework that combines BERT-based text classification with Federated Learning. To the best of our knowledge, FedFreezeBERT represents the most cost-effective approach that integrates BERT within a Federated Learning environment.
-
Achieving a new state-of-the-art performance in Arabic sentiment classification, surpassing FedSplitBERT by a significant improvement of 1.2%.
-
Reducing the communication costs with a remarkable factor of 5\(\times\) compared to the previous SOTA.
-
Enhancing FedSplitBERT’s performance on Arabic sentiment classification with the usage of aggregation architectures, achieving an improvement of 0.66%.
Related work
Federated learning
BERT in federated learning
Methodology
FedFreezeBERT
Distributed-FedFreezeBERT
Centralized-FedFreezeBERT
BERT aggregation architectures
-
Ordinaryaggregator: This architecture is the common and standard way where BERT’s [CLS] output embedding is used as the sentence embedding. The [CLS] embedding is then fed to the classifier which is a simple linear layer in our case.
-
Average aggregator: This aggregation architecture is an intuitive and very simple one in which BERT’s all final layer contextual embeddings are averaged to get a fixed size embedding representing the input sentence. Ref. [24] shows that although this aggregation architecture is very simple, it can achieve high performance when BERT parameters are kept frozen. The authors also show that its results are very near to using [CLS] embedding with fine-tuning BERT. We decided to include this aggregation architecture in our experiments as it is very simple and yet effective when BERT is frozen.
-
P-SUM: This architecture is first proposed by [13] to improve BERT performance for aspect-based sentiment analysis. Ref. [24] then proposed that this architecture performance can be further improved if BERT parameters are kept frozen. The architectural details are shown in Fig. 3. An extra four BERT layers are added on top of BERT in parallel. The four final BERT layers pass their outputs to the extra BERT layers in parallel. Each path from the four parallel paths then acts as a classifier and in the training, the four classifiers’ losses are added together. In inference time, the four classifiers’ outputs are averaged to get the final predictions. As shown by [13] and [24], the best number of last layers to be used is four and hence we decided to follow this number in the experiments.
-
H-SUM: This architecture is first proposed by [13] to improve BERT performance for aspect-based sentiment analysis. Ref. [24] then proposed that this architecture performance can be further improved if BERT parameters are kept frozen. The architectural details are shown in Fig. 4. Extra four BERT layers are added on top of BERT in a hierarchical fashion such that each extra BERT layer adds its outputs to the input of the extra BERT layer that precedes it. Each path from the four parallel paths then acts as a classifier and in the training, the four classifiers’ losses are added together. In inference time, the four classifiers’ outputs are averaged to get the final predictions. As shown by [13] and [24], the best number of last layers to be used is four and hence we decided to follow this number in the experiments.
Modified FedSplitBERT
Experiments
Dataset and evaluation metric
Task | Class | Count |
---|---|---|
Sentiment | Positive | 2180 |
Negative | 4621 | |
Neutral | 5747 | |
Total | 12,548 |
Pre-trained language model
Baseline methods
FedSplitBERT | Accuracy |
---|---|
Original [20] (from paper) | 93.27 |
Our implementation | 93.29 |
Experimental setup
Implementation
Hyperparameters
Results and analysis
Baseline Methods and FedFreezeBERT with OrdinaryAggregator | ||||
---|---|---|---|---|
Method | BERT Frozen? | \(F_1^{PN}\) | ||
Central Training | No | 73.48 | ||
FedAvg | No | 66.67 | ||
FedProx | No | 66.36 | ||
FedSplitBERT | No | 74.39 | ||
D-FedFreezeBERT (FedAvg) | Yes | 49.39 | ||
D-FedFreezeBERT (FedProx) | Yes | 47.5 | ||
C-FedFreezeBERT | Yes | 65.1 |
\(F_1^{PN}\) Using Advanced Aggregation Architectures | ||||
---|---|---|---|---|
Method | BERT Frozen? | AverageAggregator | P-SUM | H-SUM |
Central Training | No | 73.00 | 74.63 | 73.93 |
FedAvg | No | 63.84 | 71.71 | 68.81 |
FedProx | No | 63.55 | 71.56 | 68.54 |
FedSplitBERT | No | 74.08 | 74.57 | 74.88 |
D-FedFreezeBERT (FedAvg) | Yes | 54.23 | 74.18 | 74.29 |
D-FedFreezeBERT (FedProx) | Yes | 52.09 | 74.13 | 74.21 |
C-FedFreezeBERT | Yes | 72.34 | 75.26 | 74.94 |
-
\(n_c = 5\)
-
\(T = 5\)
-
\(n_s = 12548\)
-
\(n_t = 128\)
-
\(c = 8\)
-
\(SizeOf(W_s) = 113.44 MB\)
-
\(dim_{emb} = 768\)
-
\(SizeOf(BL) = 28.35 MB\)
-
\(SizeOf(BE) = 308 MB\)
-
\(SizeOf(BERT) = 651.37 MB\)
Method | Communication cost | Communication gain |
---|---|---|
(GB) | ||
FedAvg/FedProx | 31.8 | 1\(\times\) |
FedSplitBERT | 26.11 | 1.2\(\times\) |
D-FedFreezeBERT (P-SUM/HSUM) | 5.54 | 5.7\(\times\) |
C-FedFreezeBERT | 5.15 | 6.2\(\times\) |