Skip to main content
Top

2016 | OriginalPaper | Chapter

A Genetic Programming-Based Imputation Method for Classification with Missing Data

Authors : Cao Truong Tran, Mengjie Zhang, Peter Andreae

Published in: Genetic Programming

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Many industrial and real-world datasets suffer from an unavoidable problem of missing values. The ability to deal with missing values is an essential requirement for classification because inadequate treatment of missing values may lead to large errors on classification. The problem of missing data has been addressed extensively in the statistics literature, and also, but to a lesser extent in the classification literature. One of the most popular approaches to deal with missing data is to use imputation methods to fill missing values with plausible values. Some powerful imputation methods such as regression-based imputations in MICE [36] are often suitable for batch imputation tasks. However, they are often expensive to impute missing values for every single incomplete instance in the unseen set for classification. This paper proposes a genetic programming-based imputation (GPI) method for classification with missing data that uses genetic programming as a regression method to impute missing values. The experiments on six benchmark datasets and five popular classifiers compare GPI with five other popular and advanced regression-based imputation methods in MICE on two measures: classification accuracy and computation time. The results showed that, in most cases, GPI achieves classification accuracy at least as good as the other imputation methods, and sometimes significantly better. However, using GPI to impute missing values for every single incomplete instance is dramatically faster than the other imputation methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
  1. Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012, Part I. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012)View Article
  2. Andridge, R.R., Little, R.J.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 40–64 (2010)View Article
  3. Asuncion, A., Newman, D.: UCI machine learning repository (2007). http://​www.​ics.​uci.​edu/​~mlearn/​MLRepository.​html
  4. Augusto, D.A., Barbosa, H.J.: Symbolic regression via genetic programming. In: Sixth Brazilian Symposium on Neural Networks, 2000, Proceedings, pp. 173–178 (2000)
  5. Barmpalexis, P., Kachrimanis, K., Tsakonas, A., Georgarakis, E.: Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation. Chemometr. Intell. Lab. Syst. 107, 75–82 (2011)View Article
  6. Barnard, J., Meng, X.L.: Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat. Methods Med. Res. 8, 7–36 (1999)View Article
  7. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, New York (2013)
  8. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)MATH
  9. Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Soft. 45, 1–67 (2011)View Article
  10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)MATH
  11. Cunningham, P., Delany, S.J.: k-Nearest Neighbour classifiers. In: Multiple Classifier Systems, pp. 1–17 (2007)
  12. Draper, N.R., Smith, H., Pownell, E.: Applied Regression Analysis, vol. 3. Wiley, New York (1966)
  13. Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)View ArticleMATH
  14. Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)View Article
  15. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19, 263–282 (2010)View Article
  16. Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)View Article
  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009)View Article
  18. Han, J., Kamber, M., Pei, J.: Data Mining, Southeast Asia Edition: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)MATH
  19. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)View Article
  20. Keijzer, M.: Improving symbolic regression with interval arithmetic and linear scaling. In: Genetic programming, pp. 70–82 (2003)
  21. Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.: Applied regression analysis and other multivariable methods. Cengage Learning (2013)
  22. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992)MATH
  23. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2, 18–22 (2002)
  24. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley-Interscience, New York (2002)View ArticleMATH
  25. Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z., Bassett, J., Hubley, R., Chircop, A.: ECJ: A java-based evolutionary computation research system (2006) Downloadable versions and documentation can be found at the following http://​cs.​gmu.​edu/​eclab/​projects/​ecj
  26. Minka, T.: Bayesian linear regression. Technical report, 3594 Security Ticket Control (1999)
  27. Murphy, K.P.: Naive Bayes classifiers. University of British Columbia (2006)
  28. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
  29. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, New York (1997)View ArticleMATH
  30. Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, New York (1997)View ArticleMATH
  31. Silva, S., Dignum, S., Vanneschi, L.: Operator equalisation for bloat free genetic programming and a survey of bloat control methods. Genet. Program. Evolvable Mach. 13, 197–238 (2012)View Article
  32. Topchy, A., Punch, W.F.: Faster genetic programming based on local gradient search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), vol. 155162 (2001)
  33. Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pp. 583–590 (2015)
  34. Uy, N.Q., Hoai, N.X., O’Neill, M., Mckay, R.I., Galván-López, E.: Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach. 12, 91–119 (2011)View Article
  35. Van Buuren, S., Oudshoorn, C.: Multivariate imputation by chained equations. MICE V1. 0 user’s manual. Leiden: TNO Preventie en Gezondheid (2000)
  36. Van Buuren, S., Oudshoorn, K.: Flexible multivariate imputation by MICE. Technical report, PG/VGZ/99.054: TNO Prevention and Health, Leiden (1999)
  37. Vladislavleva, E.J., Smits, G.F., Den Hertog, D.: Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13, 333–349 (2009)View Article
  38. White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2011)MathSciNetView Article
Metadata
Title
A Genetic Programming-Based Imputation Method for Classification with Missing Data
Authors
Cao Truong Tran
Mengjie Zhang
Peter Andreae
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-30668-1_10

Premium Partner