Skip to main content

2020 | OriginalPaper | Buchkapitel

Impact of Dimension and Sample Size on the Performance of Imputation Methods

verfasst von : Yanjun Cui, Junhu Wang

Erschienen in: Data Science

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Real-world data collections often contain missing values, which can bring serious problems for data analysis. Simply discarding records with missing values tend to create bias in analysis. Missing data imputation methods try to fill in the missing values with estimated values. While numerous imputations methods have been proposed, these methods are mostly judged by their imputation accuracy, and little attention has been paid to their efficiency. With the increasing size of data collections, the imputation efficiency becomes an important issue. In this work we conduct an experimental comparison of several popular imputation methods, focusing on their time efficiency and scalability in terms of sample size and record dimension (number of attributes). We believe these results can provide a guide to data analysts when choosing imputation methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Lakshminarayan, K., Harp, S.A., Samad, T.: Imputation of missing data in industrial databases. Appl. Intell. 11, 259–275 (1999)CrossRef Lakshminarayan, K., Harp, S.A., Samad, T.: Imputation of missing data in industrial databases. Appl. Intell. 11, 259–275 (1999)CrossRef
2.
Zurück zum Zitat Pan, X.-Y., Tian, Y., Huang, Y., Chen, H.-B.: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics 97, 257–264 (2011)CrossRef Pan, X.-Y., Tian, Y., Huang, Y., Chen, H.-B.: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics 97, 257–264 (2011)CrossRef
3.
Zurück zum Zitat Pooler, P.S.: Handling missing data: applications to environmental analysis. J. Am. Stat. Assoc. 101, 400–401 (2006) CrossRef Pooler, P.S.: Handling missing data: applications to environmental analysis. J. Am. Stat. Assoc. 101, 400–401 (2006) CrossRef
4.
Zurück zum Zitat Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)CrossRef Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)CrossRef
5.
Zurück zum Zitat Sun, Y., Braga-Neto, U., Dougherty, E.R.: Impact of missing value imputation on classification for DNA microarray gene expression data: a model-based study. EURASIP J. Bioinform. Syst. (2009) Sun, Y., Braga-Neto, U., Dougherty, E.R.: Impact of missing value imputation on classification for DNA microarray gene expression data: a model-based study. EURASIP J. Bioinform. Syst. (2009)
7.
Zurück zum Zitat Yu, L.-M., Burton, A., Rivero-Arias, O.: Evaluation of software for multiple imputation of semi-continuous data. Stat. Methods Med. Res. 16, 243–258 (2007)MathSciNetCrossRef Yu, L.-M., Burton, A., Rivero-Arias, O.: Evaluation of software for multiple imputation of semi-continuous data. Stat. Methods Med. Res. 16, 243–258 (2007)MathSciNetCrossRef
8.
Zurück zum Zitat Muchlinski, D., Siroky, D., He, J., Kocher, M.: Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit. Anal. 24, 87–103 (2016)CrossRef Muchlinski, D., Siroky, D., He, J., Kocher, M.: Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit. Anal. 24, 87–103 (2016)CrossRef
9.
Zurück zum Zitat Montgomery, J.M., Olivella, S., Potter, J.D., Crisp, B.F.: An informed forensics approach to detecting vote irregularities. Polit. Anal. 23, 488–505 (2015)CrossRef Montgomery, J.M., Olivella, S., Potter, J.D., Crisp, B.F.: An informed forensics approach to detecting vote irregularities. Polit. Anal. 23, 488–505 (2015)CrossRef
10.
Zurück zum Zitat Chen, X., Xiao, Y.: A novel method for air quality data imputation by nuclear norm minimization. J. Sens. (2018) Chen, X., Xiao, Y.: A novel method for air quality data imputation by nuclear norm minimization. J. Sens. (2018)
11.
Zurück zum Zitat White, I.R., Daniel, R., Royston, P.: Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Comput. Stat. Data Anal. 54, 2267–2275 (2010)MathSciNetCrossRef White, I.R., Daniel, R., Royston, P.: Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Comput. Stat. Data Anal. 54, 2267–2275 (2010)MathSciNetCrossRef
12.
Zurück zum Zitat Shao, J., Meng, W., Sun, G.: Evaluation of missing value imputation methods for wireless soil datasets. Pers. Ubiquit. Comput. 21, 113–123 (2017)CrossRef Shao, J., Meng, W., Sun, G.: Evaluation of missing value imputation methods for wireless soil datasets. Pers. Ubiquit. Comput. 21, 113–123 (2017)CrossRef
13.
Zurück zum Zitat Kornelsen, K., Coulibaly, P.: Comparison of interpolation, statistical, and data-driven methods for imputation of missing values in a distributed soil moisture dataset. J. Hydrol. Eng. 19, 26–43 (2017)CrossRef Kornelsen, K., Coulibaly, P.: Comparison of interpolation, statistical, and data-driven methods for imputation of missing values in a distributed soil moisture dataset. J. Hydrol. Eng. 19, 26–43 (2017)CrossRef
14.
Zurück zum Zitat Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. Biometrics Biostatistics 6, 1 (2015) Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. Biometrics Biostatistics 6, 1 (2015)
15.
Zurück zum Zitat Huang, H., Huang, F.: A comparison study of reconstruction and multiple imputation in social network analysis. Adv. Psychol. 8, 642–648 (2018)CrossRef Huang, H., Huang, F.: A comparison study of reconstruction and multiple imputation in social network analysis. Adv. Psychol. 8, 642–648 (2018)CrossRef
16.
Zurück zum Zitat Van Buuren, S., Boshuizen, H.C., Knook, D.L.: Multiple imputation of missing blood pressure covariates in survival analysis. Stat. Med. 18, 681–694 (1999)CrossRef Van Buuren, S., Boshuizen, H.C., Knook, D.L.: Multiple imputation of missing blood pressure covariates in survival analysis. Stat. Med. 18, 681–694 (1999)CrossRef
17.
Zurück zum Zitat Troyanskaya, O., et al.: Missing value estimation for DNA microarray. Bioinformatics 17, 520–525 (2001)CrossRef Troyanskaya, O., et al.: Missing value estimation for DNA microarray. Bioinformatics 17, 520–525 (2001)CrossRef
18.
Zurück zum Zitat Lei, C., Song-Can, C.: Survey on matrix completion models and algorithms. J. Softw. 28, 1547–1564 (2017)MathSciNetMATH Lei, C., Song-Can, C.: Survey on matrix completion models and algorithms. J. Softw. 28, 1547–1564 (2017)MathSciNetMATH
19.
Zurück zum Zitat Cai, J.-F., Candes, E.J., Shen, Z.: A singular value Thresholding Algorithm for matrix completion. Soc. Ind. Appl. Math. 20, 1956–1982 (2010)MathSciNetMATH Cai, J.-F., Candes, E.J., Shen, Z.: A singular value Thresholding Algorithm for matrix completion. Soc. Ind. Appl. Math. 20, 1956–1982 (2010)MathSciNetMATH
20.
Zurück zum Zitat Oba, S., Sato, M.-A., et al.: Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19, 2088–2096 (2003)CrossRef Oba, S., Sato, M.-A., et al.: Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19, 2088–2096 (2003)CrossRef
21.
Zurück zum Zitat Vach, W.: Missing values: statistical theory and computational practice. Comput. Stat., 345–354 (1994) Vach, W.: Missing values: statistical theory and computational practice. Comput. Stat., 345–354 (1994)
22.
Zurück zum Zitat Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)CrossRef Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)CrossRef
23.
Zurück zum Zitat White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2010)MathSciNetCrossRef White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30, 377–399 (2010)MathSciNetCrossRef
24.
Zurück zum Zitat Finley, A.O., McRoberts, R.E., Ek, A.R.: Applying an efficient k-Nearest Neighbor search to forest attribute imputation. For. Sci. 52, 130–135 (2006) Finley, A.O., McRoberts, R.E., Ek, A.R.: Applying an efficient k-Nearest Neighbor search to forest attribute imputation. For. Sci. 52, 130–135 (2006)
25.
Zurück zum Zitat Crookston, N.L., Finley, A.O.: yaImpute: an R Package for kNN Imputation. J. Stat. Softw. 23, 16 (2008)CrossRef Crookston, N.L., Finley, A.O.: yaImpute: an R Package for kNN Imputation. J. Stat. Softw. 23, 16 (2008)CrossRef
26.
Zurück zum Zitat Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43, 570–577 (1995)MathSciNetCrossRef Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43, 570–577 (1995)MathSciNetCrossRef
27.
Zurück zum Zitat SuykensJ, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999)CrossRef SuykensJ, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999)CrossRef
28.
Zurück zum Zitat Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2, 18–22 (2002) Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2, 18–22 (2002)
29.
Zurück zum Zitat Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995) Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)
31.
Zurück zum Zitat Zhou, Z.: Machine Learning. Tsinghua University Press, Beijing (2016) Zhou, Z.: Machine Learning. Tsinghua University Press, Beijing (2016)
32.
Zurück zum Zitat Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton (2004)MATH Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton (2004)MATH
33.
Zurück zum Zitat Luengo, J., Garca, S., Herrera, F.: On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77–108 (2012)CrossRef Luengo, J., Garca, S., Herrera, F.: On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77–108 (2012)CrossRef
34.
Zurück zum Zitat Brock, G., Shaffer, J., Blakesley, R., Lotz, M., Tseng, G.: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 9, 1–12 (2004) Brock, G., Shaffer, J., Blakesley, R., Lotz, M., Tseng, G.: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinf. 9, 1–12 (2004)
35.
Zurück zum Zitat Deb, R., Liew, A.W.-C.: Missing value imputation for the analysis of incomplete traffic accident data. Inf. Sci. 339, 274–289 (2016)CrossRef Deb, R., Liew, A.W.-C.: Missing value imputation for the analysis of incomplete traffic accident data. Inf. Sci. 339, 274–289 (2016)CrossRef
36.
Zurück zum Zitat Liu, Y., Brown, S.D.: Comparison of five iterative imputation methods for multivariate classification. Chemometr. Intell. Lab. Syst. 120, 106–115 (2013)CrossRef Liu, Y., Brown, S.D.: Comparison of five iterative imputation methods for multivariate classification. Chemometr. Intell. Lab. Syst. 120, 106–115 (2013)CrossRef
37.
Zurück zum Zitat Musil, C.M., Warner, C.B., et al.: A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24, 815–829 (2002)CrossRef Musil, C.M., Warner, C.B., et al.: A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24, 815–829 (2002)CrossRef
38.
Zurück zum Zitat Johnston, J., Kistemaker, G., Sullivan, P.G.: Comparison of different imputation methods. Interbull Bull. 44, 26–29 (2011) Johnston, J., Kistemaker, G., Sullivan, P.G.: Comparison of different imputation methods. Interbull Bull. 44, 26–29 (2011)
39.
Zurück zum Zitat Waljee, A.K., Mukherjee, A., et al.: Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 (2013) Waljee, A.K., Mukherjee, A., et al.: Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 (2013)
46.
Zurück zum Zitat Bø, T.H., Dysvik, B., Jonassen, I.: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32 (2004) Bø, T.H., Dysvik, B., Jonassen, I.: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32 (2004)
48.
Zurück zum Zitat Azur, M.J., Stuart, E.A., et al.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49 (2011)CrossRef Azur, M.J., Stuart, E.A., et al.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20, 40–49 (2011)CrossRef
49.
Zurück zum Zitat Zhang, S., Li, X., et al.: Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 5, 1774–1784 (2018)MathSciNetCrossRef Zhang, S., Li, X., et al.: Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 5, 1774–1784 (2018)MathSciNetCrossRef
50.
Zurück zum Zitat Chen, Y., Li, Y., et al.: Data envelopment analysis with missing data: a multiple linear regression analysis approach. Int. J. Inf. Tech. Decis. Making 13, 137–153 (2015)CrossRef Chen, Y., Li, Y., et al.: Data envelopment analysis with missing data: a multiple linear regression analysis approach. Int. J. Inf. Tech. Decis. Making 13, 137–153 (2015)CrossRef
Metadaten
Titel
Impact of Dimension and Sample Size on the Performance of Imputation Methods
verfasst von
Yanjun Cui
Junhu Wang
Copyright-Jahr
2020
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-2810-1_51