nach oben

Data Mining and Knowledge Discovery

Erschienen in:

25.07.2020

MIDIA: exploring denoising autoencoders for missing data imputation

verfasst von: Qian Ma, Wang-Chien Lee, Tao-Yang Fu, Yu Gu, Ge Yu

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.

Vorheriger Artikel Challenges in benchmarking stream learning algorithms with real-world data

Nächster Artikel Bayesian mean-parameterized nonnegative binary matrix factorization

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://archive.ics.uci.edu/ml/datasets/Air+Quality.

http://archive.ics.uci.edu/ml/datasets/Adult.

http://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

We only consider the number of hidden layers in Encoder since the Decoder is symmetric with the Encoder.

Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11(2):253–264

Anagnostopoulos C, Triantafillou P (2014) Scaling out big data missing value imputations: pythia vs. godzilla. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 651–660

Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64

Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156MathSciNetMATH

Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–50

Bergstra J, Desjardins G, Lamblin P, Bengio Y (2009) Quadratic polynomials learn better image features. Technical report, p 1337

Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18(1):7133–7171 MathSciNetMATH

Borovicka T, Jirina-Jr M, Kordik P, Jirina M (2012) Selecting representative data sets. In: Advances in data mining knowledge discovery and applications, pp 43–70

Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp 177–186

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38MathSciNetMATH

Dong X, Gabrilovich E, Heitz G et al (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 601–610

Gharibshah Z, Zhu XQ, Hainline A, Conway M (2020) Deep learning for user interest and response prediction in online display advertising. Data Sci Eng 5(1):12–26

Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 249–256

Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 315–323

Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of international workshop on artificial neural networks, pp 195–201

Jain YK, Bhandare SK (2011) Min max normalization based data perturbation method for privacy protection. Int J Comput Commun Technol 2(8):45–50

Jing XY, Qi FM, Wu F, Xu BW (2016) Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In: Proceedings of IEEE/ACM international conference on software engineering, pp 607–618

Joenssen DW, Bankhofer U (2012) Hot deck methods for imputing missing data—the effects of limiting donor usage. In: International workshop on machine learning and data mining in pattern recognition, pp 63–75

Jonathan ACS, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J 339(7713):157–160

Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform 5:160

Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198

Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Discov Eng 17(4):491–502

Lovedeep G, Wang K (2017) Multiple imputation using deep denoising autoencoders. CoRR arXiv:1705.02737

Magnani M (2004) Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01):2007. http://magnanim.web.cs.unibo.it/index.html

McNeish D (2017) Missing data methods for arbitrary missingness with small samples. J Appl Stat 44(1):24–39MathSciNet

Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of international conference on international conference on machine learning, pp 807–814

Qin Y, Zhang S, Zhu X et al (2009) POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Syst Appl 36(2):2794–2804

Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27(1):85–96

Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50

Sinclair JM, Wilkes GA, Krebs WA (2001) Collins concise dictionary. HarperCollins, New York

Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437

Troyanskaya OG, Cantor MN, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

Verboven S, Branden KV, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6):320–327MATH

Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of international conference on machine learning, pp 1096–1103

Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371–3408MathSciNetMATH

Vito SD, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757

Wang QH, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Stat 29(3):563–576MathSciNetMATH

Wang QH, Rao JNK (2002b) Empirical likelihood-based inference under imputation for missing response data. Ann Stat 30(3):896–924MathSciNetMATHCrossRef

Yuan YC (2010) Multiple imputation for missing data: concepts and new development, vol 49. SAS Institute Inc, Rockville, pp 1–11

Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38MathSciNet

Zhang Y, Liu YC (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417MathSciNet

Zhang CQ, Zhu XF, Zhang JL, Qin YS, Zhang SC (2007) GBKII: an imputation method for missing values. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 1080–1087

Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38(10):1112–1120

Zhou XB, Wang XD, Dougherty ER (2003) Construction of genomic networks using mutual-information clustering and reversible-jump markov-chain-monte-carlo predictor design. Signal Process 83(4):745–761MATH

Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121

Titel: MIDIA: exploring denoising autoencoders for missing data imputation
verfasst von: Qian Ma
Wang-Chien Lee
Tao-Yang Fu
Yu Gu
Ge Yu
Publikationsdatum: 25.07.2020
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 6/2020
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-020-00706-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2020

Gaussian bandwidth selection for manifold learning and classification

ColluEagle: collusive review spammer detection using Markov random fields

Visualizing image content to explain novel image discovery

InceptionTime: Finding AlexNet for time series classification

Correction to: A unified view of density-based methods for semi-supervised clustering and classification

Introducing time series snippets: a new primitive for summarizing long time series