Introduction
-
For high-dimensional cancer datasets, making good use of the features based on prior knowledge and keeping the stability of BQAXR, PKSFS is proposed to reduce the computational complexity of BQAXR and improve the accuracy of cancer survival prediction. As demonstrated in our numerical studies, compared with the widely used feature selection methods, PKSFS can better guide the subsequent model construction.
-
We develop a two-stage heterogeneous stacked ensemble learning model, namely BQAXR, to predict the survival status of gastric cancer and skin cancer patients. In BQAXR, we attempt to improve the deficiencies of the learners and integrate them in two stages through the stacked generalization strategy to further improve the accuracy of cancer survival prediction. Specifically, BQAXR improves the shortcomings of four heterogeneous base learners, and employs a stacked generalization strategy to integrate through advanced meta learner, multi-layer perception based on the rectified Adam optimizer RAdam. To the best of our knowledge, this is the first ensemble learning model for gastric cancer and skin cancer survival prediction, and the experimental results demonstrate the superiority of BQAXR compared with popular machine learning methods.
-
Most studies on cancer survival prediction focus on breast cancer [8], colorectal cancer [9], etc. Gastric cancer, as one of the top three cancer diseases in death cases, is ignored. Furthermore, some rare types of cancer, such as skin cancer, are also ignored. Thus, from the perspective of common and uncommon cancer diseases, real cancer datasets including gastric cancer and skin cancer are collected to support this study, and the superiority of the proposed method for cancer survival prediction would be verified on the two different cancer datasets.
Literature review
Survival prediction methods
Heterogeneous ensemble learning methods
Study | Problem | Ensemble method | Stacked ensemble | Improved learner | Learner number | Ensemble member | Performance | |
---|---|---|---|---|---|---|---|---|
Heterogeneous | Homogeneous | |||||||
Wang et al. [9] | Cancer prognosis | √ | × | × | 21 | DT | → RF, RSB, GB, ADB, RT | |
Wang et al. [8] | Breast cancer diagnosis | √ | × | × | 12 | SVM | → SVM, NB, NN, WVBE | |
Ali et al. [5] | Disease diagnosis | √ | √ | × | 2 | L1-SVM, L2-SVM | → Adaboost, RF, ET | |
Thongkam et al. [18] | Breast cancer diagnosis | √ | × | × | 2 | RT, ADB | → SVM, RT, ADB | |
Cho and Won [19] | Cancer classification | √ | × | × | 4 | MLP, KNN, SVM, SASOM | → MLP, KNN, SVM, SASOM | |
Chungsoo et al. [6] | The case of death | √ | √ | × | 3 | LLR, GB, XGBoost | → LLR, GB | |
Xiao et al. [21] | Cancer diagnosis | √ | √ | × | 6 | KNN, SVM, DT, RF, GBDT, DNN | → KNN, SVM, DT, RF, GBDT, MV | |
Adem et al. [26] | Cancer diagnosis | √ | √ | × | 6 | KNN, SVM, DT, FFNN, RoF, SC | → KNN, SVM, DT, FFNN, RoF, SC | |
Bashir et al. [27] | Cancer prognosis | √ | × | × | 5 | NB, DTG, DTI, SVM, MBL | → NB, DTG, DTI, SVM, MBL | |
Velusamy and Ramasamy [28] | Disease diagnosis | √ | × | × | 3 | KNN, RF, SVM | → RF, KNN, SVM; | |
This study | Cancer prognosis | √ | √ | √ | 5 | BKNN, QSVM, AMLP, XGB, RMLP | → DT, LR, SVM, NB, KNN, BKNN, RF, ADB, XGB, LGB, QSVM, GSVM |
Materials and methods
Data preparation
Data acquisition
Data pre-processing
Priori knowledge- and stability-based feature selection
N | Feature name | Score | N | Feature name | Score |
---|---|---|---|---|---|
1 | EOD_E | 1 | 13 | Race_C | 0.32 |
2 | YoB | 0.78 | 14 | Seer_rKentucky—2000 + | 0.31 |
3 | Race_Korean (1988 +) | 0.74 | 15 | SEERH_Distant | 0.26 |
4 | AAD | 0.68 | 16 | Race_White | 0.23 |
5 | SEERH_Localized | 0.65 | 17 | SN_2nd | 0.2 |
6 | RnCDS_Surgery performed | 0.59 | 18 | RnCDS_Not recommended | 0.2 |
7 | EOD_N | 0.54 | 19 | SS2000_Distant | 0.2 |
8 | MSAD_Married | 0.52 | 20 | Seer_registry_Greater Georgia—2000+ | 0.1 |
9 | EOD_S | 0.47 | 21 | Race_Filipino | 0.07 |
10 | SS2000_Localized | 0.42 | 22 | FMPI_Yes | 0.07 |
11 | NHIA_South or CAeB | 0.41 | 23 | Seer_registry_Los Angeles—1992+ | 0.06 |
12 | NHIA_NSHL | 0.34 | 24 | FMPI_No | 0.04 |
N | Feature name | Score | N | Feature name | Score |
---|---|---|---|---|---|
1 | YoB | 0.9 | 12 | SEX_Female | 0.2 |
2 | GRADE_PD; GIII | 0.75 | 13 | YoD | 0.19 |
3 | FMPI_Yes | 0.51 | 14 | MSaD_Widowed | 0.12 |
4 | FMPI_No | 0.51 | 15 | PSL_C44.6-Skin of upper limb | 0.1 |
5 | GRADE_ IV | 0.47 | 16 | GRADE_Well differentiated; Grade I | 0.09 |
6 | AAD | 0.45 | 17 | SN_One primary only | 0.07 |
7 | SN_2nd of 2 or more primaries | 0.27 | 18 | MoD_January | 0.07 |
8 | MSaD_Married (including common law) | 0.24 | 19 | EXTENT | 0.07 |
9 | SEERH_Distant | 0.23 | 20 | RN | 0.06 |
10 | GRADE_ II | 0.23 | 21 | NHIA _Non-Spanish-Hispanic-Latino | 0.04 |
11 | SEX_Male | 0.21 | 22 | PSL_C44.4-Skin of scalp and neck | 0.04 |
A two-stage heterogeneous stacked ensemble learning method
Base learners pool in the first stage
Meta learner in the second stage
Evaluation indicator
Numerical results
Comparison of different feature selection methods
Dataset | Indicator | WF | IG | GA | RF | HFS | PKSFS |
---|---|---|---|---|---|---|---|
Gastric cancer | Accuracy | 0.8139 | 0.8142 | 0.7690 | 0.8090 | 0.7725 | 0.8209 |
Recall | 0.7891 | 0.7919 | 0.7771 | 0.8143 | 0.7665 | 0.8100 | |
Precision | 0.8332 | 0.8217 | 0.7704 | 0.8204 | 0.7856 | 0.8352 | |
AUC | 0.8173 | 0.8133 | 0.7712 | 0.8112 | 0.7736 | 0.8203 | |
Number of features | 125 | 27 | 19 | 63 | 15 | 24 | |
Skin cancer | Accuracy | 0.8201 | 0.8233 | 0.8017 | 0.7984 | 0.8111 | 0.8336 |
Recall | 0.8611 | 0.8793 | 0.8336 | 0.8432 | 0.8474 | 0.8910 | |
Precision | 0.8460 | 0.8223 | 0.8105 | 0.8112 | 0.8300 | 0.8332 | |
AUC | 0.8173 | 0.8127 | 0.7977 | 0.8041 | 0.8053 | 0.8214 | |
Number of features | 114 | 27 | 17 | 57 | 16 | 22 |
Comparison of different ensemble mechanisms
Dataset | Indicator | S | H | A | I | BQAXR | D_SP | D_HP | D_AP | D_IP |
---|---|---|---|---|---|---|---|---|---|---|
Gastric cancer | Accuracy | 0.8039 | 0.8063 | 0.7811 | 0.7910 | 0.8209 | 0.0170 | 0.0146 | 0.0398 | 0.0299 |
Recall | 0.8110 | 0.7919 | 0.8672 | 0.7252 | 0.8100 | − 0.0010 | 0.0181 | − 0.0572 | 0.0848 | |
Precision | 0.8087 | 0.8325 | 0.7477 | 0.8466 | 0.8352 | 0.0265 | 0.0027 | 0.0875 | − 0.0114 | |
F1-score | 0.8099 | 0.8117 | 0.8030 | 0.7812 | 0.8224 | 0.0125 | 0.0107 | 0.0194 | 0.0412 | |
AUC | 0.8040 | 0.8120 | 0.7754 | 0.7930 | 0.8203 | 0.0163 | 0.0083 | 0.0449 | 0.0273 | |
Skin cancer | Accuracy | 0.8232 | 0.8124 | 0.8030 | 0.8214 | 0.8336 | 0.0104 | 0.0212 | 0.0306 | 0.0122 |
Recall | 0.8664 | 0.8659 | 0.8910 | 0.8173 | 0.8910 | 0.0246 | 0.0251 | 0.0000 | 0.0737 | |
Precision | 0.8291 | 0.8217 | 0.7932 | 0.8733 | 0.8332 | 0.0041 | 0.0115 | 0.0400 | − 0.0401 | |
F1-score | 0.8473 | 0.8432 | 0.8393 | 0.8444 | 0.8611 | 0.0138 | 0.0179 | 0.0219 | 0.0168 | |
AUC | 0.8152 | 0.8093 | 0.7874 | 0.8214 | 0.8214 | 0.0089 | 0.0148 | 0.0367 | 0.0028 |
Comparison of different stacked strategies
Dataset | First stage | Second stage | Accuracy | Recall | Precision | F1-score | AUC |
---|---|---|---|---|---|---|---|
Gastric cancer | KNN + SVM + MLP + XGB (unimproved) | LR | 0.7897 | 0.8000 | 0.7933 | 0.7967 | 0.7893 |
BKNN + QSVM + AMLP + XGB (improved) | 0.8130 | 0.8080 | 0.8240 | 0.8160 | 0.8130 | ||
KNN + SVM + MLP + XGB (unimproved) | SVM | 0.7868 | 0.7889 | 0.7955 | 0.7922 | 0.7868 | |
BKNN + QSVM + AMLP + XGB (improved) | 0.8060 | 0.7940 | 0.8240 | 0.8090 | 0.8070 | ||
KNN + SVM + MLP + XGB (unimproved) | RMLP | 0.7954 | 0.7972 | 0.8039 | 0.8005 | 0.7954 | |
BKNN + QSVM + AMLP + XGB (improved) | 0.8209 | 0.8100 | 0.8352 | 0.8224 | 0.8203 | ||
Skin cancer | KNN + SVM + MLP + XGB (unimproved) | LR | 0.7971 | 0.8812 | 0.7911 | 0.8337 | 0.7818 |
BKNN + QSVM + AMLP + XGB (improved) | 0.8200 | 0.8810 | 0.8200 | 0.8490 | 0.8090 | ||
KNN + SVM + MLP + XGB (unimproved) | SVM | 0.8020 | 0.8713 | 0.8037 | 0.8361 | 0.7904 | |
BKNN + QSVM + AMLP + XGB (improved) | 0.8140 | 0.8710 | 0.8190 | 0.8440 | 0.8040 | ||
KNN + SVM + MLP + XGB (unimproved) | RMLP | 0.8070 | 0.8713 | 0.8008 | 0.8341 | 0.7870 | |
BKNN + QSVM + AMLP + XGB (improved) | 0.8336 | 0.8910 | 0.8332 | 0.8611 | 0.8214 |
Comparison between BQAXR and advanced classification methods
Dataset | Type | Model | Accuracy | Recall | Precision | F1-score | AUC |
---|---|---|---|---|---|---|---|
Gastric cancer | Single classifier | DT | 0.7123 | 0.5722 | 0.7132 | 0.6350 | 0.7281 |
LR | 0.7706 | 0.7439 | 0.7968 | 0.7694 | 0.7703 | ||
SVM | 0.7711 | 0.7393 | 0.7917 | 0.7646 | 0.7709 | ||
NB | 0.7534 | 0.6780 | 0.8110 | 0.7386 | 0.7550 | ||
KNN | 0.7551 | 0.7440 | 0.7952 | 0.7687 | 0.7556 | ||
Ensemble classifier | RF | 0.7827 | 0.7781 | 0.7954 | 0.7867 | 0.7830 | |
Adaboost | 0.7738 | 0.7494 | 0.7893 | 0.7688 | 0.7744 | ||
XGBoost | 0.8020 | 0.8021 | 0.8147 | 0.8084 | 0.8031 | ||
LightGBM | 0.7867 | 0.7640 | 0.7746 | 0.7693 | 0.7881 | ||
Improved classifier | QSVM | 0.7971 | 0.7643 | 0.8276 | 0.7947 | 0.7982 | |
GSVM | 0.7892 | 0.7705 | 0.8151 | 0.7922 | 0.7888 | ||
BKNN | 0.7900 | 0.7941 | 0.8022 | 0.7981 | 0.7979 | ||
Proposed classifier | BQAXR | 0.8209 | 0.8100 | 0.8352 | 0.8224 | 0.8203 | |
Skin cancer | Single classifier | DT | 0.7905 | 0.8216 | 0.8184 | 0.8200 | 0.7911 |
LR | 0.8053 | 0.8166 | 0.8406 | 0.8284 | 0.8040 | ||
SVM | 0.8062 | 0.8463 | 0.8222 | 0.8341 | 0.7976 | ||
NB | 0.8000 | 0.8660 | 0.8031 | 0.8334 | 0.7878 | ||
KNN | 0.7804 | 0.7180 | 0.8257 | 0.7681 | 0.7910 | ||
Ensemble classifier | RF | 0.8160 | 0.8564 | 0.8124 | 0.8338 | 0.7984 | |
Adaboost | 0.8094 | 0.8513 | 0.8233 | 0.8371 | 0.8011 | ||
XGBoost | 0.8026 | 0.8322 | 0.8282 | 0.8302 | 0.7983 | ||
LightGBM | 0.8029 | 0.8456 | 0.8182 | 0.8317 | 0.7989 | ||
Improved classifier | QSVM | 0.8172 | 0.8610 | 0.8290 | 0.8447 | 0.8090 | |
GSVM | 0.7987 | 0.8431 | 0.8165 | 0.8296 | 0.7973 | ||
BKNN | 0.8108 | 0.8661 | 0.8181 | 0.8414 | 0.8014 | ||
Proposed classifier | BQAXR | 0.8336 | 0.8910 | 0.8332 | 0.8611 | 0.8214 |
Classifier | Gastric cancer | Skin cancer | ||
---|---|---|---|---|
MCC | Cohen’s kappa | MCC | Cohen’s kappa | |
DT | 0.4795 | 0.4770 | 0.6074 | 0.6074 |
KNN | 0.5560 | 0.4990 | 0.5789 | 0.5713 |
RF | 0.5906 | 0.5906 | 0.5883 | 0.5880 |
LR | 0.5589 | 0.5588 | 0.6168 | 0.6140 |
Adaboost | 0.5847 | 0.5846 | 0.6029 | 0.6060 |
XGBoost | 0.5908 | 0.5907 | 0.5818 | 0.5810 |
SVM | 0.5851 | 0.5850 | 0.5929 | 0.5883 |
NB | 0.5144 | 0.5071 | 0.5767 | 0.5638 |
BKNN | 0.5873 | 0.5894 | 0.5996 | 0.6045 |
QSVM | 0.5962 | 0.5943 | 0.6287 | 0.6268 |
GSVM | 0.5884 | 0.5901 | 0.624 | 0.6213 |
LightGBM | 0.6051 | 0.6050 | 0.6237 | 0.6233 |
BQAXR | 0.6200 | 0.6220 | 0.6701 | 0.6721 |
Dataset | Gastric cancer | Skin cancer | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BKNN | SVM | QSVM | AMLP | XGBoost | BQAXR | BKNN | SVM | QSVM | AMLP | XGBoost | BQAXR | ||
KNN | Accuracy | 0.034 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.013 | 0 |
Recall | 0 | 0 | 0 | 0 | 0 | 0 | 0.023 | 0.010 | 0 | 0.144 | 0 | 0 | |
Precision | 0 | 0.021 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.004 | |
AUC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.006 | 0 | 0 | 0.441 | 0 | |
BKNN | Accuracy | – | 0.121 | 0 | 0.001 | 0.001 | 0 | – | 0 | 0 | 0 | 0 | 0 |
Recall | – | 0 | 0.016 | 0.001 | 0.001 | 0 | – | 0 | 0.001 | 0.017 | 0 | 0 | |
Precision | – | 0.213 | 0.001 | 0 | 0 | 0 | – | 0.375 | 0.001 | 0 | 0.316 | 0.001 | |
AUC | – | 0 | 0 | 0.001 | 0 | 0 | – | 0.131 | 0.028 | 0 | 0 | 0 | |
SVM | Accuracy | – | – | 0.004 | 0.014 | 0 | 0 | – | – | 0.010 | 0 | 0 | 0 |
Recall | – | – | 0.012 | 0 | 0 | 0.003 | – | – | 0.023 | 0.323 | 0 | 0.001 | |
Precision | – | – | 0 | 0 | 0 | 0 | – | – | 0.011 | 0.023 | 0.4 | 0 | |
AUC | – | – | 0 | 0.422 | 0 | 0 | – | – | 0 | 0 | 0.004 | 0 | |
QSVM | Accuracy | – | – | – | 0.012 | 0.032 | 0 | – | – | – | 0.002 | 0.017 | 0 |
Recall | – | – | – | 0 | 0.012 | 0 | – | – | – | 0 | 0.023 | 0 | |
Precision | – | – | – | 0.004 | 0.033 | 0.01 | – | – | – | 0.038 | 0.032 | 0.001 | |
AUC | – | – | – | 0.743 | 0.042 | 0.026 | – | – | – | 0.376 | 0.042 | 0.001 | |
AMLP | Accuracy | – | – | – | – | 0.02 | 0 | – | – | – | – | 0.007 | 0 |
Recall | – | – | – | – | 0.001 | 0 | – | – | – | – | 0 | 0 | |
Precision | – | – | – | – | 0.012 | 0.001 | – | – | – | – | 0 | 0.034 | |
AUC | – | – | – | – | 0.042 | 0.03 | – | – | – | – | 0.032 | 0.015 | |
XGBoost | Accuracy | – | – | – | – | – | 0.001 | – | – | – | – | – | 0 |
Recall | – | – | – | – | – | 0 | – | – | – | – | – | 0 | |
Precision | – | – | – | – | – | 0 | – | – | – | – | – | 0 | |
AUC | – | – | – | – | – | 0.002 | – | – | – | – | – | 0.007 |
Discussion
Type | References | The structure of the proposed model | Gastric cancer | Skin cancer | ||
---|---|---|---|---|---|---|
Accuracy | AUC | Accuracy | AUC | |||
Heterogeneous ensemble | [27] | NB + DTG + DTI + SVM + MBL | 0.7560 | 0.7514 | 0.7753 | 0.7832 |
[28] | KNN + RF + SVM | 0.7848 | 0.7819 | 0.7905 | 0.7804 | |
[18] | Adaboost + RF | 0.7802 | 0.7803 | 0.7709 | 0.7603 | |
Homogeneous ensemble | [8] | Ensemble SVM | 0.7881 | 0.7879 | 0.8013 | 0.8000 |
[43] | LightGBM | 0.787 | 0.788 | 0.8032 | 0.7963 | |
[44] | XGBoost | 0.8020 | 0.8031 | 0.8025 | 0.8002 | |
Stacked ensemble | [6] | First: LLR + GB; Second: XGBoost | 0.7984 | 0.7989 | 0.8225 | 0.8140 |
[5] | Stacking SVM | 0.7883 | 0.7869 | 0.8143 | 0.8018 | |
[21] | First stage: KNN + DT + RF + GBDT + SVM; Second: MLP | 0.7885 | 0.7901 | 0.7900 | 0.7971 | |
This paper | BQAXR | 0.8209 | 0.8203 | 0.8336 | 0.8214 |