Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 7/2019

09.07.2018 | Original Article

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

verfasst von: Chunkai Zhang, Ying Zhou, Jianwei Guo, Guoquan Wang, Xuan Wang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 7/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

High-dimensional problems result in bad classification results because some combinations of features have an adverse effect on classification; while class-imbalanced problems make the classifier to concern the majority class more but the minority less, because the number of samples of majority class is more than minority class. The problem of both high-dimensional and class-imbalanced classification is found in many fields such as bioinformatics, healthcare and so on. Many researchers study either the high-dimensional problem or class-imbalanced problem and come up with a series of algorithms, but they ignore the above new problem, which indicates high-dimensional problems affect sampling process while class-imbalanced problems interfere feature selection. Firstly, this paper analyses the new problem arising from the mutual influence of the two problems, and then introduces SVM and analyses its advantages in dealing high-dimensional problem and class-imbalanced problem. Next, this paper proposes a new algorithm named BRFE-PBKS-SVM aimed at high-dimensional class-imbalanced datasets, which improves SVM-RFE by considering the class-imbalanced problem in the process of feature selection, and it also improves SMOTE so that the procedure of over-sampling could work in the Hilbert space with an adaptive over-sampling rate by PSO. Finally, the experimental results show the performance of this algorithm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Provost F (2008) Machine learning from imbalanced data sets 101 (extended abstract). In: 2011 international conference of soft computing and pattern recognition (SoCPaR). IEEE, Piscataway, pp 435–439 Provost F (2008) Machine learning from imbalanced data sets 101 (extended abstract). In: 2011 international conference of soft computing and pattern recognition (SoCPaR). IEEE, Piscataway, pp 435–439
2.
Zurück zum Zitat Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23:1638–1654CrossRef Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23:1638–1654CrossRef
3.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
4.
Zurück zum Zitat Huang YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class-imbalanced problem. Nonlinear Anal Real World Appl 7:720–747MathSciNetCrossRefMATH Huang YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class-imbalanced problem. Nonlinear Anal Real World Appl 7:720–747MathSciNetCrossRefMATH
5.
Zurück zum Zitat Wang XZ, Zhang T, Wang R (2017) Noniterative deep learning: incorporating restricted Boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst 99:1–10 Wang XZ, Zhang T, Wang R (2017) Noniterative deep learning: incorporating restricted Boltzmann machine into multilayer random weight neural networks. IEEE Trans Syst Man Cybern Syst 99:1–10
6.
Zurück zum Zitat Bhlmann P, Sara, Van De Geer (2013) Statistics for high-dimensional data: methods, theory and applications. J Jpn Stat Soc 44:247–249 Bhlmann P, Sara, Van De Geer (2013) Statistics for high-dimensional data: methods, theory and applications. J Jpn Stat Soc 44:247–249
7.
Zurück zum Zitat Guo B, Damper RI, Gunn SR, Nelson JDB (2008) A fast separability-based feature-selection method for high-dimensional remotely sensed image classification. Pattern Recogn 41:1653–1662CrossRefMATH Guo B, Damper RI, Gunn SR, Nelson JDB (2008) A fast separability-based feature-selection method for high-dimensional remotely sensed image classification. Pattern Recogn 41:1653–1662CrossRefMATH
8.
Zurück zum Zitat Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 685–690 Yu L, Liu H (2003) Efficiently handling feature redundancy in high-dimensional data. In: ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 685–690
9.
Zurück zum Zitat Wang XZ, Wang R, Xu C (2017) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715CrossRef Wang XZ, Wang R, Xu C (2017) Discovering the relationship between generalization and uncertainty by incorporating complexity of classification. IEEE Trans Cybern 48(2):703–715CrossRef
10.
Zurück zum Zitat Shen D, Shen H, Marron JS (2013) Consistency of sparse PCA in high dimension, low sample size contexts. J Multivar Anal 115:317–333MathSciNetCrossRefMATH Shen D, Shen H, Marron JS (2013) Consistency of sparse PCA in high dimension, low sample size contexts. J Multivar Anal 115:317–333MathSciNetCrossRefMATH
11.
Zurück zum Zitat Zhuang X-S, Dai D-Q (2007) Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn 40:1570–1578CrossRefMATH Zhuang X-S, Dai D-Q (2007) Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn 40:1570–1578CrossRefMATH
12.
Zurück zum Zitat Arif M (2012) Similarity-dissimilarity plot for visualization of high-dimensional data in biomedical pattern classification. J Med Syst 36:1173–1181CrossRef Arif M (2012) Similarity-dissimilarity plot for visualization of high-dimensional data in biomedical pattern classification. J Med Syst 36:1173–1181CrossRef
13.
Zurück zum Zitat Imani M, Ghassemian H (2016) Binary coding based feature extraction in remote sensing high-dimensional data. Inf Sci 342:191–208CrossRef Imani M, Ghassemian H (2016) Binary coding based feature extraction in remote sensing high-dimensional data. Inf Sci 342:191–208CrossRef
14.
Zurück zum Zitat Singh B, Kushwaha N, Vyas O-P (2014) A feature subset selection technique for high-dimensional data using symmetric uncertainty. J Data Anal Inf Process 2(4):95–105 Singh B, Kushwaha N, Vyas O-P (2014) A feature subset selection technique for high-dimensional data using symmetric uncertainty. J Data Anal Inf Process 2(4):95–105
15.
Zurück zum Zitat Eiamkanitchat N, Theera-Umpon N, Auephanwiriyakul S (2015) On feature selection and rule extraction for high-dimensional data: a case of diffuse large B-cell lymphomas microarrays classification. Math Probl Eng 9:1–12CrossRef Eiamkanitchat N, Theera-Umpon N, Auephanwiriyakul S (2015) On feature selection and rule extraction for high-dimensional data: a case of diffuse large B-cell lymphomas microarrays classification. Math Probl Eng 9:1–12CrossRef
16.
Zurück zum Zitat García V, Sánchez JS, Mollineda RA (2011) Classification of high dimensional and imbalanced hyperspectral imagery data. In: Iberian conference on pattern recognition and image analysis. Springer, Berlin, pp 644–651CrossRef García V, Sánchez JS, Mollineda RA (2011) Classification of high dimensional and imbalanced hyperspectral imagery data. In: Iberian conference on pattern recognition and image analysis. Springer, Berlin, pp 644–651CrossRef
17.
Zurück zum Zitat Farid DM, Nowe A, Manderick B (2016) Ensemble of trees for classifying high-dimensional imbalanced genomic data. In: Proceedings of SAI intelligent systems conference. Springer, Berlin, pp 172–187 Farid DM, Nowe A, Manderick B (2016) Ensemble of trees for classifying high-dimensional imbalanced genomic data. In: Proceedings of SAI intelligent systems conference. Springer, Berlin, pp 172–187
18.
Zurück zum Zitat Liu Q, Lu X, He Z, Zhang C, Chen WS (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl Based Syst 134:189–198CrossRef Liu Q, Lu X, He Z, Zhang C, Chen WS (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl Based Syst 134:189–198CrossRef
19.
Zurück zum Zitat Gui L, Zhou Y, Xu R, He Y, Lu Q (2017) Learning representations from heterogeneous network for sentiment classification of product reviews. Knowl-Based Syst 124:34–45CrossRef Gui L, Zhou Y, Xu R, He Y, Lu Q (2017) Learning representations from heterogeneous network for sentiment classification of product reviews. Knowl-Based Syst 124:34–45CrossRef
20.
Zurück zum Zitat Chen T, Xu R, He Y, Wang X (2017) Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Exp Syst Appl 72:221–230CrossRef Chen T, Xu R, He Y, Wang X (2017) Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Exp Syst Appl 72:221–230CrossRef
21.
Zurück zum Zitat Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: IEEE international conference on data mining workshops. IEEE, Piscataway, pp 507–514 Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: IEEE international conference on data mining workshops. IEEE, Piscataway, pp 507–514
22.
Zurück zum Zitat Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: International conference on machine learning and application. IEEE, Piscataway, pp 245–250 Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: International conference on machine learning and application. IEEE, Piscataway, pp 245–250
23.
Zurück zum Zitat Blagus R, Lusa L (2012) Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. Int Conf Mach Learn Appl 2:89–94 Blagus R, Lusa L (2012) Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. Int Conf Mach Learn Appl 2:89–94
24.
Zurück zum Zitat Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246CrossRef Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246CrossRef
25.
26.
Zurück zum Zitat Gashler M, Martinez T (2011) Temporal nonlinear dimensionality reduction. In: International joint conference on neural networks, pp 1959–1966 Gashler M, Martinez T (2011) Temporal nonlinear dimensionality reduction. In: International joint conference on neural networks, pp 1959–1966
27.
Zurück zum Zitat Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: 2015 IEEE 17th international conference on high performance computing and communications, 2015 IEEE 7th international symposium on cyberspace safety and security, and 2015 IEEE 12th international conference on embedded software and systems. IEEE, Piscataway, pp 1314–1319 Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: 2015 IEEE 17th international conference on high performance computing and communications, 2015 IEEE 7th international symposium on cyberspace safety and security, and 2015 IEEE 12th international conference on embedded software and systems. IEEE, Piscataway, pp 1314–1319
28.
Zurück zum Zitat Zhang C, Jia P (2014) DBBoost-enhancing imbalanced classification by a novel ensemble based technique. In: International conference on medical biometrics. IEEE, Piscataway, pp 210–215 Zhang C, Jia P (2014) DBBoost-enhancing imbalanced classification by a novel ensemble based technique. In: International conference on medical biometrics. IEEE, Piscataway, pp 210–215
29.
Zurück zum Zitat Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25:1460–1475CrossRef Wang R, Wang XZ, Kwong S, Xu C (2017) Incorporating diversity and informativeness in multiple-instance active learning. IEEE Trans Fuzzy Syst 25:1460–1475CrossRef
30.
Zurück zum Zitat Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetCrossRef Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetCrossRef
31.
Zurück zum Zitat Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067CrossRef Ling CX, Sheng VS, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067CrossRef
32.
Zurück zum Zitat Zhang S, Liu L, Zhu X, Zhang C (2008) A strategy for attributes selection in cost-sensitive decision trees induction. In: International conference on computer and information technology workshops. ACM, New York, pp 8–13 Zhang S, Liu L, Zhu X, Zhang C (2008) A strategy for attributes selection in cost-sensitive decision trees induction. In: International conference on computer and information technology workshops. ACM, New York, pp 8–13
33.
Zurück zum Zitat Guyon I, Weston J, Barnhill S (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422CrossRefMATH Guyon I, Weston J, Barnhill S (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422CrossRefMATH
34.
Zurück zum Zitat Wang J, Yun B, Huang P, Liu YA (2013) Applying threshold SMOTE algoritwith attribute bagging to imbalanced datasets. In: International conference on rough sets and knowledge technology. Springer, Berlin, pp 221–228CrossRef Wang J, Yun B, Huang P, Liu YA (2013) Applying threshold SMOTE algoritwith attribute bagging to imbalanced datasets. In: International conference on rough sets and knowledge technology. Springer, Berlin, pp 221–228CrossRef
35.
Zurück zum Zitat Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887
36.
Zurück zum Zitat Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. Bmc Bioinformatics 14(1):106CrossRef Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. Bmc Bioinformatics 14(1):106CrossRef
37.
Zurück zum Zitat Kwok JT, Tsang IW (2004) The pre-image problem in kernel methods. IEEE Trans Neural Netw 15(6):1517–1525CrossRef Kwok JT, Tsang IW (2004) The pre-image problem in kernel methods. IEEE Trans Neural Netw 15(6):1517–1525CrossRef
39.
Zurück zum Zitat Chang C-C, Lin C-J (2011) Libsvm. ACM Trans Intell Syst Technol TIST 2(3):27 Chang C-C, Lin C-J (2011) Libsvm. ACM Trans Intell Syst Technol TIST 2(3):27
Metadaten
Titel
Research on classification method of high-dimensional class-imbalanced datasets based on SVM
verfasst von
Chunkai Zhang
Ying Zhou
Jianwei Guo
Guoquan Wang
Xuan Wang
Publikationsdatum
09.07.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 7/2019
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-018-0853-2

Weitere Artikel der Ausgabe 7/2019

International Journal of Machine Learning and Cybernetics 7/2019 Zur Ausgabe

Neuer Inhalt