Skip to main content

2018 | OriginalPaper | Buchkapitel

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

verfasst von : Fitore Muharemi, Doina Logofătu, Florin Leon

Erschienen in: Computational Collective Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

When we collect data, usually they consist of small samples with missing values. As a consequence of this flaw, the data analysis becomes less effective. Almost all algorithms for statistical data analysis need a complete data set. In data preprocessing, we have to deal with missing values. Some well-known methods for filling missing values are: Mean, K-nearest neighbours (kNN), fuzzy K-means (FKM), etc. There are quite a lot of R packages offering the imputation of missing values, but sometimes its hard to find the appropriate algorithm for a particular dataset. When we have to deal with large datasets sometimes, these known methods cannot work as supposed because they need too much memory to perform their operations. This paper provides an overview of a considerable dataset imputation by applying three different algorithms. A comparison was performed using three different algorithms under a missing completely at random (MCAR) assumption, and based on the evaluation criteria: Root mean squared error (RMSE). The experiment results show that Random Forest algorithm can be quite useful for missing values imputation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Allison, P.D.: Missing data: quantitative applications in the social sciences. Br. J. Math. Stat. Psychol. 55(1), 193–196 (2002)CrossRef Allison, P.D.: Missing data: quantitative applications in the social sciences. Br. J. Math. Stat. Psychol. 55(1), 193–196 (2002)CrossRef
2.
Zurück zum Zitat Breiman, L.: Random forests Leo Breiman and Adele Cutler. Random Forests-Classification Description (2015) Breiman, L.: Random forests Leo Breiman and Adele Cutler. Random Forests-Classification Description (2015)
3.
Zurück zum Zitat Christopher, F., Thomas: Gecco 2015 recovering missing information in heating system recovering missing information in heating system operating dataoperating data (2015) Christopher, F., Thomas: Gecco 2015 recovering missing information in heating system recovering missing information in heating system operating dataoperating data (2015)
4.
Zurück zum Zitat Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)MathSciNetMATH Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)MathSciNetMATH
5.
Zurück zum Zitat Faisal, S., Tutz, G.: Nearest neighbor imputation for categorical data by weighting of attributes. arXiv preprint arXiv:1710.01011 (2017) Faisal, S., Tutz, G.: Nearest neighbor imputation for categorical data by weighting of attributes. arXiv preprint arXiv:​1710.​01011 (2017)
6.
Zurück zum Zitat Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)CrossRef Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)CrossRef
7.
Zurück zum Zitat Mitchell, M.W.: Bias of the random forest out-of-bag (OOB) error for certain input parameters (2011) Mitchell, M.W.: Bias of the random forest out-of-bag (OOB) error for certain input parameters (2011)
8.
Zurück zum Zitat Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biometrics Biostatistics 6(1), 1 (2015) Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biometrics Biostatistics 6(1), 1 (2015)
9.
Zurück zum Zitat Shrive, F.M., Stuart, H., Quan, H., Ghali, W.A.: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Methodol. 6(1), 57 (2006)CrossRef Shrive, F.M., Stuart, H., Quan, H., Ghali, W.A.: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Methodol. 6(1), 57 (2006)CrossRef
10.
Zurück zum Zitat Troyanskaya, O., et al.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRef Troyanskaya, O., et al.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRef
11.
Zurück zum Zitat Wang, D., et al.: Effects of replacing the unreliable cdna microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 22(23), 2883–2889 (2006)CrossRef Wang, D., et al.: Effects of replacing the unreliable cdna microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 22(23), 2883–2889 (2006)CrossRef
12.
Zurück zum Zitat Zhang, S.: Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)CrossRef Zhang, S.: Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)CrossRef
Metadaten
Titel
Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset
verfasst von
Fitore Muharemi
Doina Logofătu
Florin Leon
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-98446-9_36

Premium Partner