Skip to main content

2018 | OriginalPaper | Buchkapitel

Dealing with Missing Data and Uncertainty in the Context of Data Mining

verfasst von : Aliya Aleryani, Wenjia Wang, Beatriz De La Iglesia

Erschienen in: Hybrid Artificial Intelligent Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
2.
Zurück zum Zitat Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004) Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004)
3.
Zurück zum Zitat Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)MathSciNetMATH Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)MathSciNetMATH
4.
Zurück zum Zitat Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)CrossRef Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)CrossRef
5.
Zurück zum Zitat García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRef García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRef
6.
Zurück zum Zitat Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015) Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015)
7.
Zurück zum Zitat George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013) George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013)
9.
Zurück zum Zitat Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)MathSciNetCrossRef Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)MathSciNetCrossRef
10.
Zurück zum Zitat Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)CrossRef Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)CrossRef
11.
Zurück zum Zitat Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997) Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997)
12.
Zurück zum Zitat Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007) Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)
14.
Zurück zum Zitat Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)MATH Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)MATH
15.
Zurück zum Zitat Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014) Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)
16.
Zurück zum Zitat Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996) Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996)
17.
Zurück zum Zitat Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)CrossRef Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)CrossRef
18.
Zurück zum Zitat Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002) Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002)
19.
Zurück zum Zitat Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)MATH Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)MATH
20.
Zurück zum Zitat Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013) Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013)
21.
Zurück zum Zitat Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001) Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001)
22.
23.
Zurück zum Zitat van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)CrossRef van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)CrossRef
24.
Zurück zum Zitat Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016) Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016)
25.
Zurück zum Zitat Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)CrossRef Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)CrossRef
Metadaten
Titel
Dealing with Missing Data and Uncertainty in the Context of Data Mining
verfasst von
Aliya Aleryani
Wenjia Wang
Beatriz De La Iglesia
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-92639-1_24

Premium Partner