Skip to main content
Top
Published in: Soft Computing 23/2020

22-06-2020 | Methodologies and Application

Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism

Authors: Wenbin Pei, Bing Xue, Lin Shang, Mengjie Zhang

Published in: Soft Computing | Issue 23/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Genetic programming (GP) has been successfully applied to classification. However, GP may evolve biased classifiers when encountering the problem of class imbalance. These biased classifiers are often not reliable to be applied to some real-world applications. High dimensionality makes it more difficult for classifiers to effectively separate the majority class and the minority class. The use of GP to handle the joint effect of high dimensionality and class imbalance has not been heavily investigated. In this paper, we propose a GP approach to high-dimensional imbalanced classification, with the goals of increasing the classification performance as well as saving training time. To achieve this goal, a new fitness function is developed to solve the problem of class imbalance, and moreover, a strategy is proposed to reuse previous good GP individuals for improving efficiency. The proposed method is examined on ten high-dimensional imbalanced datasets. Experimental results show that, for high-dimensional imbalanced classification, the proposed method generally outperforms other GP methods and traditional classification algorithms using sampling methods to solve the problem of class imbalance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363CrossRef Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363CrossRef
go back to reference Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 31(10):6345–6363 Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 31(10):6345–6363
go back to reference Bhowan U, Zhang M, Johnston M (2010) Genetic programming for classification with unbalanced data. In: European conference on genetic programming. Springer, p 1–13 Bhowan U, Zhang M, Johnston M (2010) Genetic programming for classification with unbalanced data. In: European conference on genetic programming. Springer, p 1–13
go back to reference Bhowan U, Johnston M, Zhang M (2011a) Ensemble learning and pruning in multi-objective genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 192–202 Bhowan U, Johnston M, Zhang M (2011a) Ensemble learning and pruning in multi-objective genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 192–202
go back to reference Bhowan U, Johnston M, Zhang M (2011b) Evolving ensembles in multi-objective genetic programming for classification with unbalanced data. In: Proceedings of the 13th annual conference on genetic and evolutionary computation. ACM, pp 1331–1338 Bhowan U, Johnston M, Zhang M (2011b) Evolving ensembles in multi-objective genetic programming for classification with unbalanced data. In: Proceedings of the 13th annual conference on genetic and evolutionary computation. ACM, pp 1331–1338
go back to reference Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):406–421CrossRef Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):406–421CrossRef
go back to reference Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 17(3):368–386CrossRef Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput 17(3):368–386CrossRef
go back to reference Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 18(6):893–908CrossRef Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 18(6):893–908CrossRef
go back to reference Blagus R, Lusa L (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinform 14(1):64CrossRef Blagus R, Lusa L (2013) Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinform 14(1):64CrossRef
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef
go back to reference Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6CrossRef
go back to reference Curry R, Lichodzijewski P, Heywood MI (2007) Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans Syst Man Cybern Part B (Cybern) 37(4):1065–1073CrossRef Curry R, Lichodzijewski P, Heywood MI (2007) Scaling genetic programming to large datasets using hierarchical dynamic subset selection. IEEE Trans Syst Man Cybern Part B (Cybern) 37(4):1065–1073CrossRef
go back to reference Ertekin S, Huang J, Bottou L, Giles L (2007a) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 127–136 Ertekin S, Huang J, Bottou L, Giles L (2007a) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management. ACM, pp 127–136
go back to reference Ertekin S, Huang J, Giles CL (2007b) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, vol 7, pp 823–824 Ertekin S, Huang J, Giles CL (2007b) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, vol 7, pp 823–824
go back to reference Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C (Appl Rev) 40(2):121–144CrossRef Espejo PG, Ventura S, Herrera F (2010) A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern Part C (Appl Rev) 40(2):121–144CrossRef
go back to reference Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, vol 99, pp 97–105 Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, vol 99, pp 97–105
go back to reference Fisher RA (1992) Statistical methods for research workers. In: Kotz S et al. (eds) Breakthroughs in statistics. Springer, pp 66–70 Fisher RA (1992) Statistical methods for research workers. In: Kotz S et al. (eds) Breakthroughs in statistics. Springer, pp 66–70
go back to reference Fleury A, Vacher M, Noury N (2010) SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans Inf Technol Biomed 14(2):274–283CrossRef Fleury A, Vacher M, Noury N (2010) SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans Inf Technol Biomed 14(2):274–283CrossRef
go back to reference Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MathSciNetMATHCrossRef Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MathSciNetMATHCrossRef
go back to reference Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484CrossRef Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484CrossRef
go back to reference Gathercole C, Ross P (1994) Dynamic training subset selection for supervised learning in genetic programming. In: International conference on parallel problem solving from nature. Springer, pp 312–321 Gathercole C, Ross P (1994) Dynamic training subset selection for supervised learning in genetic programming. In: International conference on parallel problem solving from nature. Springer, pp 312–321
go back to reference Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef
go back to reference Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
go back to reference He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks. IEEE, pp 1322–1328 He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks. IEEE, pp 1322–1328
go back to reference Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41CrossRef Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Trans Neural Netw 18(1):28–41CrossRef
go back to reference Hsieh WW (2007) Nonlinear principal component analysis of noisy data. Neural Netw 20(4):434–443MATHCrossRef Hsieh WW (2007) Nonlinear principal component analysis of noisy data. Neural Netw 20(4):434–443MATHCrossRef
go back to reference Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings 2001 IEEE international conference on data mining. IEEE, pp 257–264 Joshi MV, Kumar V, Agarwal RC (2001) Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings 2001 IEEE international conference on data mining. IEEE, pp 257–264
go back to reference Joshi A, Dangra J, Rawat M (2016) A decision tree based classification technique for accurate heart disease classification and prediction. Int J Technol Res Manag 3:1–4 Joshi A, Dangra J, Rawat M (2016) A decision tree based classification technique for accurate heart disease classification and prediction. Int J Technol Res Manag 3:1–4
go back to reference Li J, Li X, Yao X (2005) Cost-sensitive classification with genetic programming. In: The 2005 IEEE congress on evolutionary computation, vol 3. IEEE, pp 2114–2121 Li J, Li X, Yao X (2005) Cost-sensitive classification with genetic programming. In: The 2005 IEEE congress on evolutionary computation, vol 3. IEEE, pp 2114–2121
go back to reference Li P, Chan KL, Fang W (2006) Hybrid kernel machine ensemble for imbalanced data sets. In: 18th international conference on pattern recognition (ICPR’06), vol 1. IEEE, pp 1108–1111 Li P, Chan KL, Fang W (2006) Hybrid kernel machine ensemble for imbalanced data sets. In: 18th international conference on pattern recognition (ICPR’06), vol 1. IEEE, pp 1108–1111
go back to reference Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550CrossRef Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550CrossRef
go back to reference Liu J, Chen XX, Fang L, Li JX, Yang T, Zhan Q, Tong K, Fang Z (2018) Mortality prediction based on imbalanced high-dimensional ICU big data. Comput Ind 98:218–225CrossRef Liu J, Chen XX, Fang L, Li JX, Yang T, Zhan Q, Tong K, Fang Z (2018) Mortality prediction based on imbalanced high-dimensional ICU big data. Comput Ind 98:218–225CrossRef
go back to reference Luna JM, Pechenizkiy M, del Jesus MJ, Ventura S (2017) Mining context-aware association rules using grammar-based genetic programming. IEEE Trans Cybern 48:3030–3044CrossRef Luna JM, Pechenizkiy M, del Jesus MJ, Ventura S (2017) Mining context-aware association rules using grammar-based genetic programming. IEEE Trans Cybern 48:3030–3044CrossRef
go back to reference Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 769–775 Patterson G, Zhang M (2007) Fitness functions in genetic programming for classification with unbalanced data. In: Australasian joint conference on artificial intelligence. Springer, pp 769–775
go back to reference Pears R, Finlay J, Connor AM (2014) Synthetic minority over-sampling technique (SMOTE) for predicting software build outcomes. arXiv:1407.2330 Pears R, Finlay J, Connor AM (2014) Synthetic minority over-sampling technique (SMOTE) for predicting software build outcomes. arXiv:​1407.​2330
go back to reference Pei W, Xue B, Shang L, Zhang M (2018) Genetic programming based on granular computing for classification with high-dimensional data. In: Australasian joint conference on artificial intelligence. Springer, pp 643–655 Pei W, Xue B, Shang L, Zhang M (2018) Genetic programming based on granular computing for classification with high-dimensional data. In: Australasian joint conference on artificial intelligence. Springer, pp 643–655
go back to reference Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef
go back to reference Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595CrossRef Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595CrossRef
go back to reference Song D, Heywood MI, Zincir-Heywood AN (2003) A linear genetic programming approach to intrusion detection. In: Genetic and evolutionary computation conference. Springer, pp 2325–2336 Song D, Heywood MI, Zincir-Heywood AN (2003) A linear genetic programming approach to intrusion detection. In: Genetic and evolutionary computation conference. Springer, pp 2325–2336
go back to reference Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363 Stefanowski J (2016) Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in computational statistics and data mining. Springer, pp 333–363
go back to reference Tan P-N, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India Tan P-N, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India
go back to reference Tashk ARB, Faez K (2007) Boosted bayesian kernel classifier method for face detection. In: Proceedings of the third international conference on natural computation. IEEE, pp 533–537 Tashk ARB, Faez K (2007) Boosted bayesian kernel classifier method for face detection. In: Proceedings of the third international conference on natural computation. IEEE, pp 533–537
go back to reference Tran B, Xue B, Zhang M (2016) Genetic programming for feature construction and selection in classification on high-dimensional data. Memet Comput 8(1):3–15CrossRef Tran B, Xue B, Zhang M (2016) Genetic programming for feature construction and selection in classification on high-dimensional data. Memet Comput 8(1):3–15CrossRef
go back to reference Tran B, Xue B, Zhang M (2017) Using feature clustering for GP-based feature construction on high-dimensional data. In: European conference on genetic programming. Springer, pp 210–226 Tran B, Xue B, Zhang M (2017) Using feature clustering for GP-based feature construction on high-dimensional data. In: European conference on genetic programming. Springer, pp 210–226
go back to reference Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 6:786–795CrossRef Wu G, Chang EY (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 6:786–795CrossRef
go back to reference Yang P, Xu L, Zhou BB, Zhang Z, Zomaya AY (2009) A particle swarm based hybrid system for imbalanced medical data sampling. In: BMC genomics, vol 10. BioMed Central, p S34 Yang P, Xu L, Zhou BB, Zhang Z, Zomaya AY (2009) A particle swarm based hybrid system for imbalanced medical data sampling. In: BMC genomics, vol 10. BioMed Central, p S34
go back to reference Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455CrossRef Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern 44(3):445–455CrossRef
go back to reference Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRef Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRef
go back to reference Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: IEEE 7th international symposium on cyberspace safety and security (CSS), IEEE 12th international conference on embedded software and systems (ICESS), IEEE 17th international conference on high performance computing and communications (HPCC). IEEE, pp 1314–1319 Yin H, Gai K (2015) An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In: IEEE 7th international symposium on cyberspace safety and security (CSS), IEEE 12th international conference on embedded software and systems (ICESS), IEEE 17th international conference on high performance computing and communications (HPCC). IEEE, pp 1314–1319
go back to reference Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11CrossRef Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11CrossRef
go back to reference Zhang S, Qin Z, Ling CX, Sheng S (2005) “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693CrossRef Zhang S, Qin Z, Ling CX, Sheng S (2005) “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693CrossRef
go back to reference Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77MathSciNetCrossRef Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77MathSciNetCrossRef
go back to reference Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248MATHCrossRef Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248MATHCrossRef
Metadata
Title
Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism
Authors
Wenbin Pei
Bing Xue
Lin Shang
Mengjie Zhang
Publication date
22-06-2020
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 23/2020
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-020-05056-7

Other articles of this Issue 23/2020

Soft Computing 23/2020 Go to the issue

Premium Partner