Skip to main content
main-content
Top

Hint

Swipe to navigate through the chapters of this book

2020 | OriginalPaper | Chapter

Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

Authors: Magda Friedjungová, Daniel Vašata, Maksym Balatsko, Marcel Jiřina

Published in: Computational Science – ICCS 2020

Publisher: Springer International Publishing

share
SHARE

Abstract

Missing data is one of the most common preprocessing problems. In this paper, we experimentally research the use of generative and non-generative models for feature reconstruction. Variational Autoencoder with Arbitrary Conditioning (VAEAC) and Generative Adversarial Imputation Network (GAIN) were researched as representatives of generative models, while the denoising autoencoder (DAE) represented non-generative models. Performance of the models is compared to traditional methods k-nearest neighbors (k-NN) and Multiple Imputation by Chained Equations (MICE). Moreover, we introduce WGAIN as the Wasserstein modification of GAIN, which turns out to be the best imputation model when the degree of missingness is less than or equal to \(30\%\). Experiments were performed on real-world and artificial datasets with continuous features where different percentages of features, varying from \(10\%\) to \(50\%\), were missing. Evaluation of algorithms was done by measuring the accuracy of the classification model previously trained on the uncorrupted dataset. The results show that GAIN and especially WGAIN are the best imputers regardless of the conditions. In general, they outperform or are comparative to MICE, k-NN, DAE, and VAEAC.
Footnotes
Literature
1.
go back to reference Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17, 255–287 (2011) Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17, 255–287 (2011)
2.
go back to reference Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan (2017) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan (2017)
3.
go back to reference Arroyo, Á., Herrero, Á., Tricio, V., Corchado, E., Woźniak, M.: Neural models for imputation of missing ozone data in air-quality datasets. Complexity 2018, 14 (2018) CrossRef Arroyo, Á., Herrero, Á., Tricio, V., Corchado, E., Woźniak, M.: Neural models for imputation of missing ozone data in air-quality datasets. Complexity 2018, 14 (2018) CrossRef
4.
go back to reference Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatric Res. 20(1), 40–49 (2011) CrossRef Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatric Res. 20(1), 40–49 (2011) CrossRef
5.
go back to reference Beaulieu-Jones, B.K., Moore, J.H.: Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218. World Scientific (2017) Beaulieu-Jones, B.K., Moore, J.H.: Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218. World Scientific (2017)
6.
go back to reference Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. CoRR, abs/1902.10666 (2019) Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. CoRR, abs/1902.10666 (2019)
7.
go back to reference Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016) Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016)
9.
go back to reference Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) MathSciNetMATH Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) MathSciNetMATH
10.
go back to reference Duan, Y., Lv, Y., Liu, J.-L., Wang, F.-Y.: An efficient realization of deep learning for traffic data imputation. Transp. Res. Part C Emerg. Technol. 72, 168–181 (2016) CrossRef Duan, Y., Lv, Y., Liu, J.-L., Wang, F.-Y.: An efficient realization of deep learning for traffic data imputation. Transp. Res. Part C Emerg. Technol. 72, 168–181 (2016) CrossRef
11.
go back to reference Farhangfar, A., Kurgan, L.A., Dy, J.G.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008) CrossRef Farhangfar, A., Kurgan, L.A., Dy, J.G.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008) CrossRef
13.
go back to reference Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Statist. Assoc. 32(200), 675–701 (1937) CrossRef Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Statist. Assoc. 32(200), 675–701 (1937) CrossRef
14.
go back to reference Friedman, M.: A comparison of alternative tests of significance for the problem of \(m\) rankings. Ann. Math. Statist. 11(1), 86–92 (1940) MathSciNetCrossRef Friedman, M.: A comparison of alternative tests of significance for the problem of \(m\) rankings. Ann. Math. Statist. 11(1), 86–92 (1940) MathSciNetCrossRef
16.
go back to reference Goodfellow, I.J.,et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 2672–2680 (2014) Goodfellow, I.J.,et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8–13 December 2014, Montreal, Quebec, Canada, pp. 2672–2680 (2014)
17.
go back to reference Ivanov, O., Figurnov, M., Vetrov, D.: Variational autoencoder with arbitrary conditioning. In: International Conference on Learning Representations (2019) Ivanov, O., Figurnov, M., Vetrov, D.: Variational autoencoder with arbitrary conditioning. In: International Conference on Learning Representations (2019)
18.
go back to reference Jonsson, P., Wohlin, C.: An evaluation of k-nearest neighbour imputation using likert data. In: 10th International Symposium on Software Metrics, 2004. Proceedings, pp. 108–118. IEEE (2004) Jonsson, P., Wohlin, C.: An evaluation of k-nearest neighbour imputation using likert data. In: 10th International Symposium on Software Metrics, 2004. Proceedings, pp. 108–118. IEEE (2004)
19.
go back to reference Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014) Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014)
20.
go back to reference Lee, D., Kim, J., Moon, W.-J., Ye, J.C.: Collagan: Collaborative GAN for missing image data imputation. CoRR, abs/1901.09764 (2019) Lee, D., Kim, J., Moon, W.-J., Ye, J.C.: Collagan: Collaborative GAN for missing image data imputation. CoRR, abs/1901.09764 (2019)
21.
go back to reference Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6715–6816 (2017) MathSciNetMATH Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6715–6816 (2017) MathSciNetMATH
22.
go back to reference Li, S.C.-X., Jiang, B., Marlin, B.M.: Misgan: learning from incomplete data with generative adversarial networks. CoRR, abs/1902.09599 (2019) Li, S.C.-X., Jiang, B., Marlin, B.M.: Misgan: learning from incomplete data with generative adversarial networks. CoRR, abs/1902.09599 (2019)
24.
go back to reference Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, Hoboken (2014) MATH Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, Hoboken (2014) MATH
25.
go back to reference Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J.: Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in IoT. Sensors 17(9) (2017) Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., Lloret, J.: Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in IoT. Sensors 17(9) (2017)
26.
go back to reference McCoy, J.T., Kroon, S., Auret, L.: Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21), 141–146 (2018) CrossRef McCoy, J.T., Kroon, S., Auret, L.: Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21), 141–146 (2018) CrossRef
27.
go back to reference Nazábal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using vaes. CoRR, abs/1807.03653 (2018) Nazábal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using vaes. CoRR, abs/1807.03653 (2018)
28.
go back to reference Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016) Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
29.
go back to reference Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman and Hall, London (1997) CrossRef Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman and Hall, London (1997) CrossRef
30.
go back to reference Silva-Ramírez, E.-L., Pino-Mejías, R., López-Coello, M.: Single imputation with multilayer perceptron and multiple imputationcombining multilayer perceptron and k-nearest neighbours for monotonepatterns. Appl. Soft Comput. 29, 65–74 (2015) CrossRef Silva-Ramírez, E.-L., Pino-Mejías, R., López-Coello, M.: Single imputation with multilayer perceptron and multiple imputationcombining multilayer perceptron and k-nearest neighbours for monotonepatterns. Appl. Soft Comput. 29, 65–74 (2015) CrossRef
31.
go back to reference Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 3483–3491. Curran Associates Inc. (2015) Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 3483–3491. Curran Associates Inc. (2015)
32.
go back to reference Van Buuren, S.: Flexible Imputation of Missing Data. Chapman and Hall/CRC, Boca Raton (2018) CrossRef Van Buuren, S.: Flexible Imputation of Missing Data. Chapman and Hall/CRC, Boca Raton (2018) CrossRef
33.
go back to reference Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning ACM (2008) Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning ACM (2008)
34.
go back to reference Wong, L.Z., Chen, H., Lin, S., Chen, D.C.: Imputing missing values in sensor networks using sparse data representations. In: Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, MSWiM 2014, pp. 227–230. ACM, New York (2014) Wong, L.Z., Chen, H., Lin, S., Chen, D.C.: Imputing missing values in sensor networks using sparse data representations. In: Proceedings of the 17th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, MSWiM 2014, pp. 227–230. ACM, New York (2014)
35.
go back to reference Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. In: Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5689–5698. PMLR, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018 Yoon, J., Jordon, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. In: Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5689–5698. PMLR, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018
36.
go back to reference Zadeh, A., Lim, Y.C., Liang, P.P., Morency, L.-P.: Variational auto-decoder. CoRR, abs/1903.00840 (2019) Zadeh, A., Lim, Y.C., Liang, P.P., Morency, L.-P.: Variational auto-decoder. CoRR, abs/1903.00840 (2019)
Metadata
Title
Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network
Authors
Magda Friedjungová
Daniel Vašata
Maksym Balatsko
Marcel Jiřina
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-50423-6_17