Skip to main content

2018 | OriginalPaper | Buchkapitel

Scalable Model-Based Cascaded Imputation of Missing Data

verfasst von : Jacob Montiel, Jesse Read, Albert Bifet, Talel Abdessalem

Erschienen in: Advances in Knowledge Discovery and Data Mining

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)MathSciNet Acuña, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. Classif. Clust. Data Min. Appl. 1995, 639–647 (2004)MathSciNet
2.
Zurück zum Zitat Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002) Batista, G.E.A.P.A., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Frontiers in Artificial Intelligence and Applications 87, 251–260 (2002)
3.
Zurück zum Zitat Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)MathSciNetMATH Brown, G., Pocock, A., Zhao, M.J., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)MathSciNetMATH
4.
Zurück zum Zitat Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006) Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on Machine learning vol. C, no. 1, pp. 161–168 (2006)
5.
Zurück zum Zitat Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)MathSciNetMATH Dempster, A., Laird, N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39(1), 1–38 (1977)MathSciNetMATH
6.
Zurück zum Zitat Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)MATH Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Hoboken (2012)MATH
8.
Zurück zum Zitat Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)CrossRef Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10, 300–310 (2002)CrossRef
9.
Zurück zum Zitat Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)MATH Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)MATH
10.
Zurück zum Zitat Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)MATH Guyon, I., Elisseeff, A.: An Introduction to variable and feature selection. J. Mach. Learn. Res. (JMLR) 3(3), 1157–1182 (2003)MATH
11.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
12.
Zurück zum Zitat Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011) Honaker, J., King, G., Blackwell, M.: Amelia ii: A program for missing data. J. Stat. Softw. 45(1), 1–47 (2011)
13.
Zurück zum Zitat Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)CrossRef Kang, P.: Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118, 65–78 (2013)CrossRef
14.
Zurück zum Zitat King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001) King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data. Am. Polit. Sci. Rev. 85(1269), 49–69 (2001)
15.
Zurück zum Zitat Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)MathSciNetCrossRef Lee, M., Pedrycz, W.: The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)MathSciNetCrossRef
16.
Zurück zum Zitat Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-25929-9_70CrossRef Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). https://​doi.​org/​10.​1007/​978-3-540-25929-9_​70CrossRef
17.
Zurück zum Zitat Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., Liu, H.: Feature selection: a data perspective. J. Mach. Learn. Res. 50, 1–73 (2016)
18.
Zurück zum Zitat Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)CrossRef Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken (2002)CrossRef
19.
Zurück zum Zitat Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)MathSciNetCrossRef Maier, M., Hein, M., Von Luxburg, U.: Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410, 1749–1764 (2009)MathSciNetCrossRef
20.
Zurück zum Zitat Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998) Mundfrom, D.J., Whitcomb, A.: Imputing missing values: the effect on the accuracy of classification (1998)
21.
Zurück zum Zitat Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)CrossRef Qin, Y., Zhang, S., Zhu, X., Zhang, J., Zhang, C.: POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Systems with Applications 36(2, Part 2), 2794–2804 (2009)CrossRef
22.
Zurück zum Zitat Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)MathSciNetCrossRef Racine, J., Li, Q.: Nonparametric estimation of regression functions with both categorical and continuous data. J. Econom. 119(1), 99–130 (2004)MathSciNetCrossRef
23.
Zurück zum Zitat Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)CrossRef Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)CrossRef
24.
Zurück zum Zitat Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)CrossRef Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46, 389–422 (2015)CrossRef
27.
Zurück zum Zitat Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009) Wang, L., Fu, D.M.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology ESIAT 2009 vol. 3, no. 2, pp. 660–663 (2009)
28.
Zurück zum Zitat Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006) Zhang, C., Qin, Y., Zhu, X., Zhang, J., Zhang, S.: Clustering-based missing value imputation for data preprocessing. In: 2006 IEEE International Conference on Industrial Informatics, pp. 1081–1086. IEEE (2006)
29.
Zurück zum Zitat Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)CrossRef Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011)CrossRef
Metadaten
Titel
Scalable Model-Based Cascaded Imputation of Missing Data
verfasst von
Jacob Montiel
Jesse Read
Albert Bifet
Talel Abdessalem
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-93040-4_6