Skip to main content
Erschienen in: Evolutionary Intelligence 3/2016

01.09.2016 | Special Issue

Improving performance for classification with incomplete data using wrapper-based feature selection

verfasst von: Cao Truong Tran, Mengjie Zhang, Peter Andreae, Bing Xue

Erschienen in: Evolutionary Intelligence | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Missing values are an unavoidable problem of many real-world datasets. Inadequate treatment of missing values may result in large errors on classification; thus, dealing well with missing values is essential for classification. Feature selection has been well known for improving classification, but it has been seldom used for improving classification with incomplete datasets. Moreover, some classifiers such as C4.5 are able to directly classify incomplete datasets, but they often generate more complex classifiers with larger classification errors. The purpose of this paper is to propose a wrapper-based feature selection method to improve the ability of a classifier able to classify incomplete datasets. In order to achieve the purpose, the feature selection method evaluates feature subsets using a classifier able to classify incomplete datasets. Empirical results on 14 datasets using particle swarm optimisation for searching feature subsets and C4.5 for evaluating the feature subsets in the feature selection method show that the wrapper-based feature selection is not only able to improve classification accuracy of the classifier, but also able to reduce the size of trees generated by the classifier.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Barnard J, Meng X-L (1999) Applications of multiple imputation in medical studies: from aids to nhanes. Stat Methods Med Res 8:17–36CrossRef Barnard J, Meng X-L (1999) Applications of multiple imputation in medical studies: from aids to nhanes. Stat Methods Med Res 8:17–36CrossRef
3.
Zurück zum Zitat Batista GE, Monard MC (2002) A study of K-nearest neighbour as an imputation method. HIS 87:251–260 Batista GE, Monard MC (2002) A study of K-nearest neighbour as an imputation method. HIS 87:251–260
4.
Zurück zum Zitat Berger JO (2013) Statistical decision theory and Bayesian analysis. Springer, New York Berger JO (2013) Statistical decision theory and Bayesian analysis. Springer, New York
5.
Zurück zum Zitat Bishop CM (2006) Pattern recognition and machine learning. Springer, New YorkMATH Bishop CM (2006) Pattern recognition and machine learning. Springer, New YorkMATH
6.
Zurück zum Zitat Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary pso for feature selection using gene expression data. Comput Biol Chem 32:29–38CrossRefMATH Chuang L-Y, Chang H-W, Tu C-J, Yang C-H (2008) Improved binary pso for feature selection using gene expression data. Comput Biol Chem 32:29–38CrossRefMATH
7.
Zurück zum Zitat Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3:261–283 Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3:261–283
8.
Zurück zum Zitat Clerc M, Kennedy J (2002) The particle swarm-explosion, stability, and convergence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73CrossRef Clerc M, Kennedy J (2002) The particle swarm-explosion, stability, and convergence in a multidimensional complex space. IEEE Trans Evol Comput 6:58–73CrossRef
9.
Zurück zum Zitat Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156CrossRef Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156CrossRef
10.
Zurück zum Zitat De’ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81:3178–3192CrossRef De’ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81:3178–3192CrossRef
11.
Zurück zum Zitat Doquire G, Verleysen M (2012) Feature selection with missing data using mutual information estimators. Neurocomputing 90:3–11CrossRef Doquire G, Verleysen M (2012) Feature selection with missing data using mutual information estimators. Neurocomputing 90:3–11CrossRef
12.
Zurück zum Zitat Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705CrossRefMATH Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705CrossRefMATH
13.
Zurück zum Zitat Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A Syst Hum 37:692–709CrossRef Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A Syst Hum 37:692–709CrossRef
14.
Zurück zum Zitat García S, Molina D, Lozano M, Herrera F (2009) A study on the use of non-parametric tests for analyzing the evolutionary algorithms behaviour: a case study on the cec2005 special session on real parameter optimization. J Heuristics 15:617–644CrossRefMATH García S, Molina D, Lozano M, Herrera F (2009) A study on the use of non-parametric tests for analyzing the evolutionary algorithms behaviour: a case study on the cec2005 special session on real parameter optimization. J Heuristics 15:617–644CrossRefMATH
15.
Zurück zum Zitat García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282CrossRef García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282CrossRef
16.
Zurück zum Zitat Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576CrossRef Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576CrossRef
17.
Zurück zum Zitat Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATH
18.
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11:10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11:10–18CrossRef
19.
Zurück zum Zitat Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
20.
Zurück zum Zitat Han J, Kamber M, Pei J (2006) Data mining, southeast asia edition: concepts and techniques. Morgan kaufmann, San FranciscoMATH Han J, Kamber M, Pei J (2006) Data mining, southeast asia edition: concepts and techniques. Morgan kaufmann, San FranciscoMATH
21.
Zurück zum Zitat Huang C-L, Dun J-F (2008) A distributed pso-svm hybrid system with feature selection and parameter optimization. Appl Soft Comput 8:1381–1391CrossRef Huang C-L, Dun J-F (2008) A distributed pso-svm hybrid system with feature selection and parameter optimization. Appl Soft Comput 8:1381–1391CrossRef
22.
Zurück zum Zitat Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19:153–158CrossRef Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19:153–158CrossRef
23.
Zurück zum Zitat Kennedy J (2010) Particle swarm optimization. In: Encyclopedia of machine learning, pp 760–766 Kennedy J (2010) Particle swarm optimization. In: Encyclopedia of machine learning, pp 760–766
24.
Zurück zum Zitat Kennedy J, Kennedy JF, Eberhart RC (2001) Swarm intelligence. Morgan Kaufmann, San Francisco Kennedy J, Kennedy JF, Eberhart RC (2001) Swarm intelligence. Morgan Kaufmann, San Francisco
25.
Zurück zum Zitat Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefMATH Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefMATH
26.
Zurück zum Zitat Koller D, Sahami M (1995) Toward optimal feature selection. In: 13th international conference on machine learning, pp 284–292 Koller D, Sahami M (1995) Toward optimal feature selection. In: 13th international conference on machine learning, pp 284–292
27.
Zurück zum Zitat Lane MC, Xue B, Liu I, Zhang M (2014) Gaussian based particle swarm optimisation and statistical clustering for feature selection. In: European conference on evolutionary computation in combinatorial optimization, pp 133–144 Lane MC, Xue B, Liu I, Zhang M (2014) Gaussian based particle swarm optimisation and statistical clustering for feature selection. In: European conference on evolutionary computation in combinatorial optimization, pp 133–144
28.
Zurück zum Zitat Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35:1817–1824CrossRef Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 35:1817–1824CrossRef
29.
Zurück zum Zitat Little RJ, Rubin DB (2014) Statistical analysis with missing data. Wiley, HobokenMATH Little RJ, Rubin DB (2014) Statistical analysis with missing data. Wiley, HobokenMATH
30.
Zurück zum Zitat MacKay DJ (2003) Information theory, inference, and learning algorithms, vol 7. Citeseer MacKay DJ (2003) Information theory, inference, and learning algorithms, vol 7. Citeseer
31.
Zurück zum Zitat Oh I-S, Lee J-S, Moon B-R (2004) Hybrid genetic algorithms for feature selection. IEEE Trans Pattern Anal Mach Intell 26:1424–1437CrossRef Oh I-S, Lee J-S, Moon B-R (2004) Hybrid genetic algorithms for feature selection. IEEE Trans Pattern Anal Mach Intell 26:1424–1437CrossRef
32.
Zurück zum Zitat Qian W, Shu W (2015) Mutual information criterion for feature selection from incomplete data. Neurocomputing 168:210–220CrossRef Qian W, Shu W (2015) Mutual information criterion for feature selection from incomplete data. Neurocomputing 168:210–220CrossRef
33.
Zurück zum Zitat Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
34.
35.
Zurück zum Zitat Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147CrossRef Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147CrossRef
36.
Zurück zum Zitat Tran CT, Andreae P, Zhang M (2015) Impact of imputation of missing values on genetic programming based multiple feature construction for classification. In: 2015 IEEE congress on evolutionary computation (CEC), pp 2398–2405 Tran CT, Andreae P, Zhang M (2015) Impact of imputation of missing values on genetic programming based multiple feature construction for classification. In: 2015 IEEE congress on evolutionary computation (CEC), pp 2398–2405
37.
Zurück zum Zitat Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 583–590 Tran CT, Zhang M, Andreae P (2015) Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 583–590
38.
Zurück zum Zitat Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163 Tran CT, Zhang M, Andreae P (2016) A genetic programming-based imputation method for classification with missing data. In: European conference on genetic programming. Springer, pp 149–163
39.
Zurück zum Zitat Tran CT, Zhang M, Andreae P, Xue B (2016) A wrapper feature selection approach to classification with missing data. In: Applications of evolutionary computation, pp 685–700 Tran CT, Zhang M, Andreae P, Xue B (2016) A wrapper feature selection approach to classification with missing data. In: Applications of evolutionary computation, pp 685–700
40.
Zurück zum Zitat Xue B, Zhang M, Browne W, Yao X (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 99:1 Xue B, Zhang M, Browne W, Yao X (2015) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 99:1
41.
Zurück zum Zitat Xue B, Zhang M, Browne WN (2012) Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. In: Proceedings of the thirty-fifth Australasian computer science conference, vol 122, pp 27–36 Xue B, Zhang M, Browne WN (2012) Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection. In: Proceedings of the thirty-fifth Australasian computer science conference, vol 122, pp 27–36
42.
Zurück zum Zitat Xue B, Zhang M, Browne WN (2013) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43:1656–1671CrossRef Xue B, Zhang M, Browne WN (2013) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43:1656–1671CrossRef
43.
Zurück zum Zitat Xue B, Zhang M, Browne WN (2015) A comprehensive comparison on evolutionary feature selection approaches to classification. Int J Comput Intell Appl 14:1550008CrossRef Xue B, Zhang M, Browne WN (2015) A comprehensive comparison on evolutionary feature selection approaches to classification. Int J Comput Intell Appl 14:1550008CrossRef
44.
Zurück zum Zitat Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. In: Feature extraction, construction and selection, pp 117–136 Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. In: Feature extraction, construction and selection, pp 117–136
Metadaten
Titel
Improving performance for classification with incomplete data using wrapper-based feature selection
verfasst von
Cao Truong Tran
Mengjie Zhang
Peter Andreae
Bing Xue
Publikationsdatum
01.09.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
Evolutionary Intelligence / Ausgabe 3/2016
Print ISSN: 1864-5909
Elektronische ISSN: 1864-5917
DOI
https://doi.org/10.1007/s12065-016-0141-6

Weitere Artikel der Ausgabe 3/2016

Evolutionary Intelligence 3/2016 Zur Ausgabe

Premium Partner