nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Dealing with Missing Data and Uncertainty in the Context of Data Mining

verfasst von : Aliya Aleryani, Wenjia Wang, Beatriz De La Iglesia

Erschienen in: Hybrid Artificial Intelligent Systems

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A First Attempt on Monotonic Training Set Selection

Nächstes Kapitel A Preliminary Study of Diversity in Extreme Learning Machines Ensembles

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRef

Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004)

Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)MathSciNetMATH

Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)CrossRef

García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRef

Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015)

George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013)

Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45554-X_46CrossRefMATH

Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)MathSciNetCrossRef

10.

Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)CrossRef

11.

Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997)

12.

Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)

13.

Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

14.

Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)MATH

15.

Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)

16.

Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996)

17.

Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)CrossRef

18.

Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002)

19.

Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)MATH

20.

Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013)

21.

Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001)

22.

Tran, C.T., Zhang, M., Andreae, P., Xue, B., Bui, L.T.: Multiple imputation and ensemble learning for classification with incomplete data. In: Leu, G., Singh, H.K., Elsayed, S. (eds.) Intelligent and Evolutionary Systems. PALO, vol. 8, pp. 401–415. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49049-6_29CrossRef

23.

van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)CrossRef

24.

Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016)

25.

Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)CrossRef

Titel: Dealing with Missing Data and Uncertainty in the Context of Data Mining
verfasst von: Aliya Aleryani
Wenjia Wang
Beatriz De La Iglesia
Verlag: Springer International Publishing
Buch: Hybrid Artificial Intelligent Systems
Print ISBN: 978-3-319-92638-4

Electronic ISBN: 978-3-319-92639-1

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-92639-1_24

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner