Top

Published in:

2018 | OriginalPaper | Chapter

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Authors : Fitore Muharemi, Doina Logofătu, Florin Leon

Published in: Computational Collective Intelligence

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

When we collect data, usually they consist of small samples with missing values. As a consequence of this flaw, the data analysis becomes less effective. Almost all algorithms for statistical data analysis need a complete data set. In data preprocessing, we have to deal with missing values. Some well-known methods for filling missing values are: Mean, K-nearest neighbours (kNN), fuzzy K-means (FKM), etc. There are quite a lot of R packages offering the imputation of missing values, but sometimes its hard to find the appropriate algorithm for a particular dataset. When we have to deal with large datasets sometimes, these known methods cannot work as supposed because they need too much memory to perform their operations. This paper provides an overview of a considerable dataset imputation by applying three different algorithms. A comparison was performed using three different algorithms under a missing completely at random (MCAR) assumption, and based on the evaluation criteria: Root mean squared error (RMSE). The experiment results show that Random Forest algorithm can be quite useful for missing values imputation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Evaluation of Tree Based Machine Learning Classifiers for Android Malware Detection

next chapter Project Management Model with Designed Data Flow Diagram: The Case of ICT Hybrid Learning of Elderly People in the Czech Republic

http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2015/.

Allison, P.D.: Missing data: quantitative applications in the social sciences. Br. J. Math. Stat. Psychol. 55(1), 193–196 (2002)CrossRef

Breiman, L.: Random forests Leo Breiman and Adele Cutler. Random Forests-Classification Description (2015)

Christopher, F., Thomas: Gecco 2015 recovering missing information in heating system recovering missing information in heating system operating dataoperating data (2015)

Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)MathSciNetMATH

Faisal, S., Tutz, G.: Nearest neighbor imputation for categorical data by weighting of attributes. arXiv preprint arXiv:1710.01011 (2017)

Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895–2907 (2004)CrossRef

Mitchell, M.W.: Bias of the random forest out-of-bag (OOB) error for certain input parameters (2011)

Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. Biometrics Biostatistics 6(1), 1 (2015)

Shrive, F.M., Stuart, H., Quan, H., Ghali, W.A.: Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med. Res. Methodol. 6(1), 57 (2006)CrossRef

10.

Troyanskaya, O., et al.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRef

11.

Wang, D., et al.: Effects of replacing the unreliable cdna microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 22(23), 2883–2889 (2006)CrossRef

12.

Zhang, S.: Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)CrossRef

Title: Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset
Authors: Fitore Muharemi
Doina Logofătu
Florin Leon
Publisher: Springer International Publishing
Book: Computational Collective Intelligence
Print ISBN: 978-3-319-98445-2

Electronic ISBN: 978-3-319-98446-9

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-98446-9_36

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner