1 Introduction
Aspect | This review | Liu et al. [88] | Yang et al. [148] |
---|---|---|---|
Methodology | Structured Literature Review (SLR), ensuring transparency and replicability | General overview, no clear replicable methodology | General overview, no clear replicable methodology |
Organisation | Lifecycle Approach covering VFL from foundational concepts to real-world applications | Taxonomy categorizing VFL into key areas like communication efficiency, privacy, and data evaluation | Layered approach, dividing VFL components into layers: hardware, privacy-preserving primitives, secure algorithms, VFL systems, and applications |
Coverage of VFL components | Covers all components of VFL (algorithms, communication, learning, privacy, valuation, and applications) uniformly and thoroughly discusses the most recent works | Covers most components but gives more details in areas like privacy, while providing less coverage on valuation and incentive mechanism, which is the management aspect of VFL | Discusses in detail areas like hardware and VFL algorithms, but does not mention learning challenges such as model fairness, limited training data, and feature selection |
Open challenges & future directions | Model Drift, Fairness, Incentive Mechanism, Explainability, Dataset Availability | Interoperability, Trustworthy VFL, Automated and Blockchained VFL | Explains current developments, lacks clear future directions |
2 Methodology
2.1 Research questions
-
RQ1: What methods are currently employed in VFL, and how do they address its challenges?
-
RQ2: What are the current applications of VFL?
-
RQ3: What are the potential future directions for research in VFL?
Search terms | Google scholar | Web of science | IEEE Xplore | arXiv | No. of articles | Unique articles |
---|---|---|---|---|---|---|
“Vertical federated learning” | 158 | 53 | 66 | 61 | 338 | 226 |
“Vertical” AND “Federated Learning” | 113 | 47 | 96 | 123 | 379 | 234 |
“Vertical” AND “privacy-preserving federated learning” | 24 | 10 | 9 | 8 | 51 | 29 |
“Vertical” AND “Heterogeneous Federated Learning” | 7 | 0 | 0 | 2 | 9 | 9 |
Total no. of unique articles | 271 |
2.2 Search strategy
2.3 Study selection
-
Published between 2016-2023
-
Written in English language
-
Availability of full text
-
Title and abstract specifically mention the focus on vertical federated learning
-
Is the article relevant to VFL and not just general FL?
-
Does the article provide an answer to any of the research questions?
-
Does the article demonstrate a strong methodological approach either empirical or theoretical?
-
Are the experiment setup and results properly documented?
2.4 Data extraction
2.5 Data synthesis
3 Results
4 VFL foundations
4.1 VFL architecture
Criteria | Server-client VFL | Decentralized VFL |
---|---|---|
Architecture | Central server | No central server |
Communication | Through central server | Directly between parties |
Single point of failure | Yes | No |
Synchronization | Easier | More challenging |
Security | Depends on server | Depends on protocol |
Scalability | Can be limited | More scalable |
Implementation | Easier | More challenging |
4.2 VFL protocol
4.2.1 VFL training
4.2.2 VFL inference
4.3 VFL algorithms
5 VFL development
5.1 Communication
5.1.1 Communication efficiency
Article | Method | Model | Dataset | |
---|---|---|---|---|
Modification in local updates | [87] | Stochastic Block Coordinate Descent with multiple update of local models | Logistic Regression, Neural Network | MIMIC-III, NUS-WIDE, MNIST, Default-Credit |
Quasi-Newton Method | Logistic Regression | Default-Credit | ||
[145] | Eliminates need for peer to peer communication among clients by using functional encryption schemes | Linear regression, Logistic regression, linear SVM | Website phishing, Ionosphere, Landsat satellite, Optical recognition of handwritten digits, MNIST | |
[141] | Allowed multiple local updates in each round by using alternating direction of multipliers | Convolutional Neural Network | MNIST, CIFAR-10, NUS-WIDE, ModelNet40 | |
[39] | Cache enabled local updates at each client | Neural Network | \(Criteo^{5}\), \(zu^{6}\) | |
[157] | Adaptive selection of local updates | Logistic Regression, Neural Network | a9a, MNIST, Citeseer | |
Compression | [18] | Arbitrary compression scheme on gradients of local models | Neural Network | MIMIC-III, CIFAR-10, ModelNet40 |
[147] | Transmission of selective gradients after compression | Logistic Regression | Default Credit | |
[75] | Double-end sparse compression on local models | Logistic Regression, Neural Network | Default Credit, Insurance claim dataset | |
[63] | Compression on local data using Autoencoders | Logistic Regression, SVM | Adult income, Wine-quality, Breast cancer, Rice MSC | |
[106] | Compression on local data using Autoencoders | Logistic Regression | Bank loan dataset | |
[19] | Compression on local data using Autoencoders | Neural Network | Adult income, Vestibular Schwannoma Dataset, The eICU Collaborative Research Database | |
[111] | Compression on local data containing images using feature maps | Neural Network | CIFAR-10, CIFAR-100, CINIC-10 | |
[139] | Compression on local data using unsupervised representation learning | Neural Network | NUS-WIDE, MNIST |
5.1.2 Asynchronism
5.2 Learning
5.2.1 Feature selection
Feature selection method | Architecture | FS during training | Dependency on labels | Non-overlap utilization | Privacy protocol |
---|---|---|---|---|---|
Server-Client, Decentralized | No | Independent | No | No raw data shared | |
Federated LASSO Regularization [17] | Server-Client | Yes | Dependent | No | No raw data shared |
VFLFS [35] | Decentralized | Yes | Partially Dependent | Yes | No raw data shared |
FedSDG-FS [73] | Server-Client | Yes | Dependent | No | Partial HE |
MMVFL [36] | Server-Client | Yes | Partially Dependent | No | No raw data shared |
Gini-impurity FS [160] | Server-Client | No | Dependent | No | Secret Sharing, HE |
PSO-EVFFS [164] | Decentralized | Yes | Dependent | No | HE |
5.2.2 Limited training samples
5.2.3 Model fairness
5.3 Privacy and security
5.3.1 Privacy-preserving protocols
5.3.2 Privacy attacks in VFL
5.3.3 Defense mechanisms
6 Evaluation and management
6.1 Valuation
6.2 Explainability
7 VFL deployment
7.1 Frameworks
Framework | Communication architecture | Asynchronous update | Models supported | Privacy & Security protocols | |||
---|---|---|---|---|---|---|---|
Server-client | Decentralized | HE | SMPC | DP | |||
PySyft | ✓ | \(\times \) | \(\times \) | Regression, NN | \(\times \) | \(\times \) | ✓ |
FATE | ✓ | \(\times \) | \(\times \) | Regression, Tree-based | ✓ | ✓ | \(\times \) |
PaddleFL | ✓ | \(\times \) | \(\times \) | Regression, NN, SVM | \(\times \) | ✓ | ✓ |
FedML | ✓ | ✓ | \(\times \) | Regression, NN | \(\times \) | ✓ | ✓ |
NVFLARE | ✓ | \(\times \) | ✓ | NN, Tree-based | \(\times \) | \(\times \) | ✓ |
FederatedScope | ✓ | ✓ | ✓ | Regression, NN, GNN | ✓ | ✓ | ✓ |
Flower | ✓ | \(\times \) | \(\times \) | General ML Models, NN | \(\times \) | \(\times \) | ✓ |
Crypten | ✓ | \(\times \) | \(\times \) | General ML Models, NN | \(\times \) | \(\times \) | ✓ |
FedTree | ✓ | \(\times \) | \(\times \) | Tree-based (GBDT) | \(\times \) | \(\times \) | ✓ |
7.2 Applications
8 Open challenges and future directions
8.1 Model drift
8.2 Fairness
8.3 Explainability
8.4 Incentive mechanism
8.5 Dataset availability
Category | Dataset | Data type | No. of features | No. of instances |
---|---|---|---|---|
Financial | Income [9] | Tabular | 14 | 48,842 |
Bank [95] | Tabular | 16 | 45,211 | |
Credit Card [152] | Tabular | 24 | 30,000 | |
Medical | MIMIC III [56] | Tabular | Varies by Component | 42,276 |
Breast Cancer [134] | Tabular | 30 | 569 | |
Diabetes Smith et al. [114] | Tabular | 9 | 769 | |
BHI [94] | Image | N/A | 277,524 | |
CheXpert [53] | Image | N/A | 65,240 | |
Advertising & marketing | Avazu [58] | Tabular | 24 | 4 M |
Criteo [69] | Tabular | 40 | 4.5M | |
Transportation & engineering | Vehicle [32] | Tabular | Not specified | 98,528 |
Drive [154] | Tabular | Not specified | 58,509 | |
Multimedia & web | NUSWIDE [27] | Tabular | 84 | 269,648 |
MNIST [31] | Image | N/A | 70,000 | |
ModelNet [138] | Image | N/A | 20,000 | |
Yahoo Answers [161] | Text | N/A | 1.46M | |
News20 [62] | Text | N/A | 19,928 |