Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 6/2020

25.07.2020

MIDIA: exploring denoising autoencoders for missing data imputation

verfasst von: Qian Ma, Wang-Chien Lee, Tao-Yang Fu, Yu Gu, Ge Yu

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11(2):253–264 Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11(2):253–264
Zurück zum Zitat Anagnostopoulos C, Triantafillou P (2014) Scaling out big data missing value imputations: pythia vs. godzilla. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 651–660 Anagnostopoulos C, Triantafillou P (2014) Scaling out big data missing value imputations: pythia vs. godzilla. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 651–660
Zurück zum Zitat Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64 Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64
Zurück zum Zitat Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156MathSciNetMATH Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156MathSciNetMATH
Zurück zum Zitat Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–50 Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–50
Zurück zum Zitat Bergstra J, Desjardins G, Lamblin P, Bengio Y (2009) Quadratic polynomials learn better image features. Technical report, p 1337 Bergstra J, Desjardins G, Lamblin P, Bengio Y (2009) Quadratic polynomials learn better image features. Technical report, p 1337
Zurück zum Zitat Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18(1):7133–7171 MathSciNetMATH Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18(1):7133–7171 MathSciNetMATH
Zurück zum Zitat Borovicka T, Jirina-Jr M, Kordik P, Jirina M (2012) Selecting representative data sets. In: Advances in data mining knowledge discovery and applications, pp 43–70 Borovicka T, Jirina-Jr M, Kordik P, Jirina M (2012) Selecting representative data sets. In: Advances in data mining knowledge discovery and applications, pp 43–70
Zurück zum Zitat Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp 177–186 Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp 177–186
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38MathSciNetMATH
Zurück zum Zitat Dong X, Gabrilovich E, Heitz G et al (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 601–610 Dong X, Gabrilovich E, Heitz G et al (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 601–610
Zurück zum Zitat Gharibshah Z, Zhu XQ, Hainline A, Conway M (2020) Deep learning for user interest and response prediction in online display advertising. Data Sci Eng 5(1):12–26 Gharibshah Z, Zhu XQ, Hainline A, Conway M (2020) Deep learning for user interest and response prediction in online display advertising. Data Sci Eng 5(1):12–26
Zurück zum Zitat Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 249–256 Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 249–256
Zurück zum Zitat Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 315–323 Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 315–323
Zurück zum Zitat Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of international workshop on artificial neural networks, pp 195–201 Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of international workshop on artificial neural networks, pp 195–201
Zurück zum Zitat Jain YK, Bhandare SK (2011) Min max normalization based data perturbation method for privacy protection. Int J Comput Commun Technol 2(8):45–50 Jain YK, Bhandare SK (2011) Min max normalization based data perturbation method for privacy protection. Int J Comput Commun Technol 2(8):45–50
Zurück zum Zitat Jing XY, Qi FM, Wu F, Xu BW (2016) Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In: Proceedings of IEEE/ACM international conference on software engineering, pp 607–618 Jing XY, Qi FM, Wu F, Xu BW (2016) Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In: Proceedings of IEEE/ACM international conference on software engineering, pp 607–618
Zurück zum Zitat Joenssen DW, Bankhofer U (2012) Hot deck methods for imputing missing data—the effects of limiting donor usage. In: International workshop on machine learning and data mining in pattern recognition, pp 63–75 Joenssen DW, Bankhofer U (2012) Hot deck methods for imputing missing data—the effects of limiting donor usage. In: International workshop on machine learning and data mining in pattern recognition, pp 63–75
Zurück zum Zitat Jonathan ACS, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J 339(7713):157–160 Jonathan ACS, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J 339(7713):157–160
Zurück zum Zitat Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform 5:160 Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform 5:160
Zurück zum Zitat Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198 Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Zurück zum Zitat Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Discov Eng 17(4):491–502 Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Discov Eng 17(4):491–502
Zurück zum Zitat McNeish D (2017) Missing data methods for arbitrary missingness with small samples. J Appl Stat 44(1):24–39MathSciNet McNeish D (2017) Missing data methods for arbitrary missingness with small samples. J Appl Stat 44(1):24–39MathSciNet
Zurück zum Zitat Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of international conference on international conference on machine learning, pp 807–814 Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of international conference on international conference on machine learning, pp 807–814
Zurück zum Zitat Qin Y, Zhang S, Zhu X et al (2009) POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Syst Appl 36(2):2794–2804 Qin Y, Zhang S, Zhu X et al (2009) POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Syst Appl 36(2):2794–2804
Zurück zum Zitat Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27(1):85–96 Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27(1):85–96
Zurück zum Zitat Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50 Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
Zurück zum Zitat Sinclair JM, Wilkes GA, Krebs WA (2001) Collins concise dictionary. HarperCollins, New York Sinclair JM, Wilkes GA, Krebs WA (2001) Collins concise dictionary. HarperCollins, New York
Zurück zum Zitat Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437 Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Zurück zum Zitat Troyanskaya OG, Cantor MN, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525 Troyanskaya OG, Cantor MN, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Zurück zum Zitat Verboven S, Branden KV, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6):320–327MATH Verboven S, Branden KV, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6):320–327MATH
Zurück zum Zitat Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of international conference on machine learning, pp 1096–1103 Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of international conference on machine learning, pp 1096–1103
Zurück zum Zitat Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371–3408MathSciNetMATH Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371–3408MathSciNetMATH
Zurück zum Zitat Vito SD, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757 Vito SD, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757
Zurück zum Zitat Wang QH, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Stat 29(3):563–576MathSciNetMATH Wang QH, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Stat 29(3):563–576MathSciNetMATH
Zurück zum Zitat Wang QH, Rao JNK (2002b) Empirical likelihood-based inference under imputation for missing response data. Ann Stat 30(3):896–924MathSciNetMATHCrossRef Wang QH, Rao JNK (2002b) Empirical likelihood-based inference under imputation for missing response data. Ann Stat 30(3):896–924MathSciNetMATHCrossRef
Zurück zum Zitat Yuan YC (2010) Multiple imputation for missing data: concepts and new development, vol 49. SAS Institute Inc, Rockville, pp 1–11 Yuan YC (2010) Multiple imputation for missing data: concepts and new development, vol 49. SAS Institute Inc, Rockville, pp 1–11
Zurück zum Zitat Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38MathSciNet Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38MathSciNet
Zurück zum Zitat Zhang Y, Liu YC (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417MathSciNet Zhang Y, Liu YC (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417MathSciNet
Zurück zum Zitat Zhang CQ, Zhu XF, Zhang JL, Qin YS, Zhang SC (2007) GBKII: an imputation method for missing values. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 1080–1087 Zhang CQ, Zhu XF, Zhang JL, Qin YS, Zhang SC (2007) GBKII: an imputation method for missing values. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 1080–1087
Zurück zum Zitat Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38(10):1112–1120 Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38(10):1112–1120
Zurück zum Zitat Zhou XB, Wang XD, Dougherty ER (2003) Construction of genomic networks using mutual-information clustering and reversible-jump markov-chain-monte-carlo predictor design. Signal Process 83(4):745–761MATH Zhou XB, Wang XD, Dougherty ER (2003) Construction of genomic networks using mutual-information clustering and reversible-jump markov-chain-monte-carlo predictor design. Signal Process 83(4):745–761MATH
Zurück zum Zitat Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121 Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Metadaten
Titel
MIDIA: exploring denoising autoencoders for missing data imputation
verfasst von
Qian Ma
Wang-Chien Lee
Tao-Yang Fu
Yu Gu
Ge Yu
Publikationsdatum
25.07.2020
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 6/2020
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-020-00706-8

Weitere Artikel der Ausgabe 6/2020

Data Mining and Knowledge Discovery 6/2020 Zur Ausgabe