Skip to main content

2019 | OriginalPaper | Buchkapitel

Feature Based Multivariate Data Imputation

verfasst von : Alessio Petrozziello, Ivan Jordanov

Erschienen in: Machine Learning, Optimization, and Data Science

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We investigate a new multivariate data imputation approach for dealing with variety of types of missingness. The proposed approach relies on the aggregation of the most suitable methods from a multitude of imputation techniques, adjusted to each feature of the dataset. We report results from comparison with two single imputation techniques (Random Guessing and Median Imputation) and four state-of-the-art multivariate methods (K-Nearest Neighbour Imputation, Bagged Tree Imputation, Missing Imputation Chained Equations, and Bayesian Principal Component Analysis Imputation) on several datasets from the public domain, demonstrating favorable performance for our model. The proposed method, namely Feature Guided Data Imputation is compared with the other tested methods in three different experimental settings: Missing Completely at Random, Missing at Random and Missing Not at Random with 25% missing data in the test set over five-fold cross validation. Furthermore, the proposed model has straightforward implementation and can easily incorporate other imputation techniques.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Enders, C.K.: Applied Missing Data Analysis. Guildford Press, Guidford (2010) Enders, C.K.: Applied Missing Data Analysis. Guildford Press, Guidford (2010)
2.
Zurück zum Zitat Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. of Biometrics Biostat. 6(1), 1–6 (2015) Schmitt, P., Mandel, J., Guedj, M.: A comparison of six methods for missing data imputation. J. of Biometrics Biostat. 6(1), 1–6 (2015)
3.
Zurück zum Zitat Jordanov, I., Petrov, N., Petrozziello, A.: Classifiers accuracy improvement based on missing data imputation. J. Artif. Intell. Soft Comput. Res. 8(1), 33–48 (2018)CrossRef Jordanov, I., Petrov, N., Petrozziello, A.: Classifiers accuracy improvement based on missing data imputation. J. Artif. Intell. Soft Comput. Res. 8(1), 33–48 (2018)CrossRef
4.
Zurück zum Zitat Cohen, J., Cohen, P., West, S.G., Aiken, L.S.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Routledge, Abingdon (2013)CrossRef Cohen, J., Cohen, P., West, S.G., Aiken, L.S.: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Routledge, Abingdon (2013)CrossRef
5.
Zurück zum Zitat Sarro, F., Petrozziello, A., Harman, M.: Multi-objective software effort estimation. In: 2016 IEEE/ACM 38th IEEE International Conference on Software Engineering (ICSE), Austin (2016) Sarro, F., Petrozziello, A., Harman, M.: Multi-objective software effort estimation. In: 2016 IEEE/ACM 38th IEEE International Conference on Software Engineering (ICSE), Austin (2016)
6.
Zurück zum Zitat Osborne, J., Overbay, A.: Best practices in data cleaning. Best Pract. Quant. Methods 1(1), 205–213 (2008)CrossRef Osborne, J., Overbay, A.: Best practices in data cleaning. Best Pract. Quant. Methods 1(1), 205–213 (2008)CrossRef
7.
Zurück zum Zitat Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference (2011) Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference (2011)
8.
Zurück zum Zitat Frènay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 5(5), 845–869 (2014)MATHCrossRef Frènay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 5(5), 845–869 (2014)MATHCrossRef
9.
Zurück zum Zitat Valdiviezo, C., Van Aelst, S.: Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf. Sci. 311, 163–181 (2015)CrossRef Valdiviezo, C., Van Aelst, S.: Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf. Sci. 311, 163–181 (2015)CrossRef
10.
Zurück zum Zitat Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRef Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRef
11.
Zurück zum Zitat Cartwright, M., Shepperd, M.J., Song, Q.: Dealing with missing software project data. In: Proceedings of the 9th International Software Metrics Symposium (2003) Cartwright, M., Shepperd, M.J., Song, Q.: Dealing with missing software project data. In: Proceedings of the 9th International Software Metrics Symposium (2003)
12.
Zurück zum Zitat Batista, G., Monard, M.: A study of K-nearest neighbour as a model-based method to treat missing data. In: Argentine Symposium on Artificial Intelligence (2001) Batista, G., Monard, M.: A study of K-nearest neighbour as a model-based method to treat missing data. In: Argentine Symposium on Artificial Intelligence (2001)
13.
Zurück zum Zitat Lee, M.C., Mitra, R.: Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput. Stat. Data Anal. 95(1), 24–38 (2016)MathSciNetMATHCrossRef Lee, M.C., Mitra, R.: Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput. Stat. Data Anal. 95(1), 24–38 (2016)MathSciNetMATHCrossRef
14.
Zurück zum Zitat Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)CrossRef Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)CrossRef
15.
Zurück zum Zitat Bartlett, J., Seaman, S., White, I., Carpenter, J.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods Med. Res. 24(4), 462–487 (2015)MathSciNetCrossRef Bartlett, J., Seaman, S., White, I., Carpenter, J.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods Med. Res. 24(4), 462–487 (2015)MathSciNetCrossRef
16.
Zurück zum Zitat Oba, S., Sato, M.-A., Takemasa, I., Monden, M., Matsubara, K.-I., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003)CrossRef Oba, S., Sato, M.-A., Takemasa, I., Monden, M., Matsubara, K.-I., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16), 2088–2096 (2003)CrossRef
17.
Zurück zum Zitat Petrozziello, A., Jordanov, I.: Column-wise guided data imputation. Proc. Comput. Sci. 108(1), 2282–2286 (2017)CrossRef Petrozziello, A., Jordanov, I.: Column-wise guided data imputation. Proc. Comput. Sci. 108(1), 2282–2286 (2017)CrossRef
18.
Zurück zum Zitat Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011) Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
19.
Zurück zum Zitat Pan, X.-Y., Tian, Y., Huang, Y., Shen, H.-B.: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics 97(5), 257–264 (2011)CrossRef Pan, X.-Y., Tian, Y., Huang, Y., Shen, H.-B.: Towards better accuracy for missing value estimation of epistatic miniarray profiling data by a novel ensemble approach. Genomics 97(5), 257–264 (2011)CrossRef
20.
Zurück zum Zitat Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)CrossRef Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)CrossRef
21.
Zurück zum Zitat Chai, T., Draxler, R.: Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)CrossRef Chai, T., Draxler, R.: Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 7(3), 1247–1250 (2014)CrossRef
22.
Zurück zum Zitat Whigham, P.A., Owen, C.A., Macdonell, S.G.: A baseline model for software effort estimation. ACM Trans. Softw. Eng. Methodol. (TOSEM) 24(3), 20 (2015)CrossRef Whigham, P.A., Owen, C.A., Macdonell, S.G.: A baseline model for software effort estimation. ACM Trans. Softw. Eng. Methodol. (TOSEM) 24(3), 20 (2015)CrossRef
23.
Zurück zum Zitat Gòmez-Carracedo, M., Andrade, J., Lòpez-Mahìa, P., Muniategui, S., Prada, D.: A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr. Intell. Lab. Syst. 134(1), 23–33 (2014)CrossRef Gòmez-Carracedo, M., Andrade, J., Lòpez-Mahìa, P., Muniategui, S., Prada, D.: A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr. Intell. Lab. Syst. 134(1), 23–33 (2014)CrossRef
Metadaten
Titel
Feature Based Multivariate Data Imputation
verfasst von
Alessio Petrozziello
Ivan Jordanov
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-13709-0_3