nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Scalable Model-Based Cascaded Imputation of Missing Data

verfasst von : Jacob Montiel, Jesse Read, Albert Bifet, Talel Abdessalem

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Feature Selection for Multiclass Binary Data

Nächstes Kapitel On Reducing Dimensionality of Labeled Data Efficiently

Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)MathSciNet

Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002)

Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)MathSciNetMATH

Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006)

Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)MathSciNetMATH

Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)MATH

Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)MathSciNetCrossRef

Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)CrossRef

Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)MATH

10.

Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)MATH

11.

He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef

12.

Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011)

13.

Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)CrossRef

14.

King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001)

15.

Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)MathSciNetCrossRef

16.

Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70CrossRef

17.

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016)

18.

Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)CrossRef

19.

Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)MathSciNetCrossRef

20.

Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998)

21.

Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)CrossRef

22.

Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)MathSciNetCrossRef

23.

Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)CrossRef

24.

Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)CrossRef

25.

Richman, M.B., Trafalis, T.B., Adrianto, I.: Missing data imputation through machine learning algorithms. In: Haupt, S.E., Pasini, A., Marzban, C. (eds.) Artificial Intelligence Methods in the Environmental Sciences, pp. 153–169. Springer, Dordrecht (2009). https://doi.org/10.1007/978-1-4020-9119-3_7CrossRef

26.

Su, X., Greiner, R., Khoshgoftaar, T.M., Napolitano, A.: Using classifier-based nominal imputation to improve machine learning. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6634, pp. 124–135. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20841-6_11CrossRef

27.

Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009)

28.

Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006)

29.

Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)CrossRef

Titel: Scalable Model-Based Cascaded Imputation of Missing Data
verfasst von: Jacob Montiel
Jesse Read
Albert Bifet
Talel Abdessalem
Verlag: Springer International Publishing
Buch: Advances in Knowledge Discovery and Data Mining
Print ISBN: 978-3-319-93039-8

Electronic ISBN: 978-3-319-93040-4

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-93040-4_6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"