Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns

doi:10.1016/j.asoc.2014.09.052

Applied Soft Computing

Volume 29, April 2015, Pages 65-74

https://doi.org/10.1016/j.asoc.2014.09.052 Get rights and content

Highlights

•
Imputation data for monotone patterns of missing values.
•
An estimation model of missing data based on multilayer perceptron.
•
Combination of neural network and k-nearest neighbour-based multiple imputation.
•
Comparison of the performance of proposed models with three classic procedures.
•
Three classic single imputation models: mean/mode, regression and hot-deck.

Abstract

The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures – mean/mode imputation, regression and hot-deck – were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.

Graphical abstract

Introduction

Computer assisted personal interview (CAPI), computer assisted telephone interview (CATI) or web assisted personal interview (WAPI) are some of the most common data collection systems. However, none of them guarantees perfect data sets and a certain risk of error generation is always present. In particular, missing or inconsistent values might appear because of the lack of response or an inaccurate answer recording.

Missing values can be estimated using data imputation techniques, so the gaps are filled and a complete data set is obtained. The treatment of non-response errors is a fundamental step of data cleaning, in data knowledge discovery process, to improve the information quality. Statistical agencies usually have to apply imputation techniques on the data sets resulting from their survey process, proved to be a time-consuming task.

Artificial neural networks (henceforth termed ANNs) constitute flexible computing frameworks and universal approximators that can be applied to a wide range of prediction and classification problems with a high degree of accuracy. The application of different ANN approaches to data imputation has been studied previously from different points of views.

Kuligowski and Barros [12] introduced a backpropagation neural network for missing data estimation by using concurrent rainfall data from neighbouring gauges. Refs. [6], [25] dealt with self-organizing maps (SOM) as data imputation tools in different application areas. In other works, such as [22], the use of multiple imputations for the analysis of missing data was considered.

Kalteh and Hjorth [10] imputed missing values with SOM, multilayer perceptron, multivariate nearest neighbours, the regularised expectation maximization algorithm and multiple imputation in the context of a precipitation–runoff process database. Kaya et al. [11] carried out a comparison of the neural networks, the expectation maximization algorithm and the multiple imputation techniques, while the application of genetic algorithms was proposed in [15].

Subasi et al. [23] presented a new imputation method for incomplete binary data and in [21] a methodology for data imputation by ANNs was proposed and empirically compared with other data mining model, evaluating the performance of the imputation process by employing a variant of k-nearest neighbours (k-NN) method to the classification task on imputed databases.

García-Laencina et al. [8] presented a multi-task learning (MTL) based approach using multilayer perceptron to impute missing values in classification problems. They combined classification and imputation in only one neural architecture, being classification the main task and imputation the secondary task.

Rahman and Islam [18] proposed two techniques for the imputation of both categorical and numerical missing values, using decision trees and forests. The missing values were imputed using similarity and correlations, and they merged segments to achieve a higher quality of imputation. Azim and Aggarwal [2] described a two-stage hybrid model to fill missing values using fuzzy c-means clustering and multilayer perceptrons.

Aydilek and Arslan [1] used a hybrid neural network and weighted nearest neighbours to estimate missing values. The estimation system involved an auto-associative model to predict the input data, coupled with the k-nearest neighbours to approximate the missing data.

The present work focuses on a particular missing values pattern, the monotone pattern, where a set of variables is missing on the same set of records. This pattern is appropriate to build imputation models based on the aggregation of a set of predictions. A multiple imputation approach (MIMLP from now onwards) is proposed, relying its implementation on the combination of the multilayer perceptron (hereinafter MLP) and k-nearest neighbours. It is compared with an estimation model of missing values, also based on a multilayer perceptron (IMLP in what follows) and studied in [21].

Tusell's work [24] has been followed, but, regarding variable types, the present research extends to qualitative variables, so that a case is imputed considering complete cases with Gower's distance, further explained below. Additionally, this study also provides insights into the selection of parameter values. To compare the efficiency of both methods, the classical models Hot-deck, mean/mode substitution and regression models have been also implemented.

This paper is organised as follows. In Section 2, missing data patterns and mechanism, as well as imputation methods, are described. Section 3 introduces general aspects of artificial neural networks. Section 4 deals with experiments carried out. The automatic procedure to impute missing values based on ANNs: IMLP and MIMLP models, are described in Sections 5 and 6, respectively. The comparison with other well-known methods is presented in Section 7, where the neural network configuration is extensively studied on both MIMLP and IMLP models. The results and conclusions shown in Sections 7 and 8 reveal a clear improvement in the data set quality for this machine learning approach.

Section snippets

Data imputation

In Section 2.1, the different mechanisms that generate missing values are shown, and the missing data patterns are described. In Section 2.2, the three classical single imputation procedures (mean/mode imputation, regression and hot-deck), implemented to compare the efficiency of the proposed methods, are explained.

Artificial neural networks

The outputs o_j of the considered three-layered perceptron are:

$o_{j} = w_{0 j} + \sum_{h = 1}^{H} w_{hj} g (v_{0 h} + \sum_{i = 1}^{p} v_{ih} x_{i}), j = 1, 2, \dots, q$ for p inputs x₁, …, x_p, denoting by H the size of the hidden layer, ${v_{ih}, i = 0, 1, 2, \dots, p, h = 1, 2, \dots, H}$ the synaptic weights for the connections between the p-sized input and the hidden layer and ${w_{hj}, h = 0, 1, 2, \dots, H, j = 1, 2, \dots, q}$ the synaptic weights for the connections between the hidden and the q-sized output layer.

The hyperbolic tangent activation function g(u) = (e^u − e^−u)/(e^u + e^−u) is used in the hidden

Empirical experiments

Section 4.1 provides certain aspects of the extensive suite of experiments carried out. Section 4.2 describes the data sets used to study the models. In Section 4.3, preprocessing tasks and perturbations applied to the original data sets are explained. Finally, Section 4.4 contains details about the computed measures to evaluate the different data imputation models.

IMLP model

The IMLP fitting requires to take several decisions such as the random initialization of the MLP weights, the number of hidden units or the number of iterations (epochs) of the learning algorithm and, of course, the training algorithm. The parameter values can affect the imputation quality. Thus, different architectures were studied to obtain the best configuration or at least a suitable range of values.

It is a well-known fact that the random initial configuration of weights offers very

MIMLP model

The multiple imputation method consists in obtaining a vector of M_I > 1 imputed data for each missing value. These imputed data are alternative values to fill the value of the incomplete data. From the imputation vectors, M_I sets of complete data are generated. Each missing value is replaced by the first element of its imputation vector, obtaining the first complete data set. Then, each missing value is replaced by the second element of its imputation vector, obtaining the second complete data

Experimental results and discussion

The comparison of the different methods for missing data imputation on the different data sets is shown in this section: classical imputation procedures such as mean/mode, regression and hot-deck, and both models ANN-based: IMLP and MIMLP.

To complete the study, the evaluation of all the models was carried out from two different perspectives. Firstly, the performance of the imputing methods was according to the aforementioned evaluation criteria. Secondly, the performance of several

Conclusions

Two models for data imputation IMLP and MIMLP were proposed and empirically compared with three classical methods: mean/mode imputation, regression models and Hot-deck. In the IMLP model, based on artificial neural networks, several architectures and training algorithms for the multilayer perceptron were tested. In the MIMLP model, a multiple imputation technique combining multilayer perceptron and k-nearest neighbours, a wide range of parameter configurations λ and k_max was explored. This

References (26)

P. García-Laencina et al.
Classifying patterns with missing values using multi-task learning perceptrons
Expert Syst. Appl.
(2013)
M. Rahman et al.
Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques
Knowl. Based Syst.
(2013)
E.-L. Silva-Ramírez et al.
Missing value imputation on missing completely at random data using multilayer perceptrons
Neural Netw.
(2011)
M. Subasi et al.
A new imputation method for incomplete binary data
Discret. Appl. Math.
(2011)
I. Aydilek et al.
A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks
Int. J. Innov. Comput. Inf. Control
(2012)
S. Azim et al.
Hybrid model for data imputation: using fuzzy c means and multi layer perceptron
H. Demuth et al.
Neural Network Toolbox for Use With Matlab. User's Guide
(1997)
Y. Ding et al.
An investigation of missing data methods for classification trees applied to binary response data
J. Mach. Learn. Res.
(2010)
Euredit
Interim Report on Evaluation Criteria for Statistical Editing and Imputation
(2005)
F. Fessant et al.
Self-organising map for data imputation and correction in surveys
Neural Comput. Appl.
(2002)

A. Frank et al.

UCI Machine Learning Repository

(2010)

J. Gower

A general coefficient of similarity and some of its properties

Biometrics

(1971)

A. Kalteh et al.

Imputation of missing values in a precipitation–runoff process database

Hydrol. Res.

(2009)

Cited by (94)

Optimization of missing value imputation for neural networks
2023, Information Sciences
To train a neural network with an incomplete dataset, missing values can be replaced with plausible substitutions using missing value imputation. Various missing value imputers are available for use, each with its own competencies. Using multiple different imputers can improve the predictive performance of neural networks. Existing methods selected the best imputer or combined multiple imputers, irrespective of the training of the neural network. In this study, we propose an Optimization of Missing Value Imputation (OptMVI) method for improved training of a neural network in the presence of missing values in a training dataset. For each instance in the training dataset, multiple imputations are obtained from different imputers. A convex combination of the imputations is then used as the input for the neural network, with the combination weights indicating the relative contribution of each imputer. We simultaneously train the combination weights and neural network. This allows the combination weights to be optimized toward improving the predictive performance of the neural network. Through experimental evaluation on benchmark datasets with varying missing rates, we demonstrate that the proposed method outperforms the existing methods.
Some of the variables, some of the parameters, some of the times, with some physics known: Identification with partial information
2023, Computers and Chemical Engineering
Experimental data is often comprised of variables measured independently, at different sampling rates (non-uniform $Δ t$ between successive measurements); and at a specific time point only a subset of all variables may be sampled. Approaches to identifying dynamical systems from such data typically use interpolation, imputation or subsampling to reorganize or modify the training data prior to learning. Partial physical knowledge may also be available a priori (accurately or approximately), and data-driven techniques can complement this knowledge. Here we exploit neural network architectures based on numerical integration methods and a priori physical knowledge to identify the right-hand side of the underlying governing differential equations. Iterates of such neural-network models allow for learning from data sampled at arbitrary time points without data modification. Importantly, we integrate the network with available partial physical knowledge in “physics informed gray-boxes”; this enables learning unknown kinetic rates or microbial growth functions while simultaneously estimating experimental parameters.
An Evolutionary Artificial Neural Network approach for spatio-temporal wave height time series reconstruction
2023, Applied Soft Computing
This paper proposes a novel methodology for recovering missing time series data, a crucial task for subsequent Machine Learning (ML) analyses. The methodology is specifically applied to Significant Wave Height (SWH) time series in the field of marine engineering. The proposed approach involves two phases. Firstly, the SWH time series for each buoy is independently reconstructed using three transfer function models: regression-based, correlation-based, and distance-based. The distance-based transfer function exhibits the best overall performance. Secondly, Evolutionary Artificial Neural Networks (EANNs) are utilised for the final recovery of each time series, using as inputs highly correlated buoys that have been intermediately recovered. The EANNs are evolved considering two metrics, the novel squared error relevance area, which balances the importance of extreme and around-mean values, and the well-known mean squared error. The study considers SWH time series data from 15 buoys in two coastal zones in the United States. The results demonstrate that the distance-based transfer function is generally the best transfer function, and that EANNs outperform a range of state-of-the-art ML techniques in 12 out of the 15 buoys, with a number of connections comparable to linear models. Furthermore, the proposed methodology outperforms the two most popular approaches for time series reconstruction, BRITS and SAITS, for all buoys except one. Therefore, the proposed methodology provides a promising approach, which may be applied to time series from other fields, such as wind or solar energy farms in the field of green energy.
Developing a novel approach for missing data imputation of solar radiation: A hybrid differential evolution algorithm based eXtreme gradient boosting model
2023, Energy Conversion and Management
Having sufficient and qualified datasets is of paramount importance in terms of understanding the internal dynamics of the nature-related phenomenon. Given the necessity to maintain the completeness of the datasets, this study introduced a novel technique containing the implementation of machine learning algorithms and a meta-heuristic optimization algorithm for imputing the gaps encountered in measurements of solar radiation which is one of the crucial meteorological variables in terms of not only climate dynamics but also energy technologies. To accomplish this aim, four different gap sizes, i.e., 5 %, 10 %, 20 %, and 30 %, have synthetically been constituted and the applicability of the extreme gradient boosting (XGBoost) configured by the differential evolution (DE) was examined for each gap size. The corresponding model was benchmarked with conventional interpolation techniques (i.e., linear and spline optimizations) and other widely applied ML algorithms (i.e., random forest and multivariate adaptive regression splines). A multi-perspective input selection strategy was considered to model the missing values based on correlation coefficients under three scenarios encompassing a total of 14 different models. The results revealed that the XGBoost-DE model generated with the solar radiation measurements of neighboring stations was found as the best-performed model in all gap sizes, i.e., 5 % (NSE: 0.950; KGE: 0.967), 10 % (NSE:0.934; KGE: 0.962), and 30 % (NSE: 0.939; KGE: 0.957), but 20 % which the highest accuracy was obtained with the RF (NSE: 0.944; KGE: 0.966). On the other hand, the interpolation techniques had the lowest accuracies among their counterparts in imputation attempts with respect to all gap size alternatives.
Machine learning's performance in classifying postmenopausal osteoporosis Thai patients
2023, Intelligence-Based Medicine
This work investigates the performance of different machine learning (ML) methods for classifying postmenopausal osteoporosis Thai patients. Our dataset contains 377 samples compiled retrospectively using the medical records of a Thai woman in the postmenopause stage from the obstetrics and gynecology clinic, Ramathibodi Hospital, Bangkok, Thailand. Missing data imputation, feature selection, and handling imbalanced techniques are independently applied as pre-processing approaches. The performance of different ML algorithms, including k-nearest neighbors (k-NN), neural network (NN), naïve Bayesian (NB), Bayesian network (BN), support vector machine (SVM), random forest (RF), and decision tree (DT), is compared between the pre-processed and original data. The results demonstrate that different ML algorithms combined with pre-processing techniques achieve varying results. In terms of accuracy, the three best-performing methods are the NN, NB, and RF models when a wrapper approach is used with an appropriate learner. In terms of specificity, the DT model achieves the best performance when the synthetic minority oversampling technique method is applied. When feature selection techniques are applied, the k-NN, BN, and SVM algorithms obtain the best sensitivity, whereas the NN shows the best area under the curve. Overall, in comparison with the original dataset, the pre-processed approaches improved model performance. Therefore, proper pre-processing techniques should be considered when developing ML classifiers to identify the best appropriate model.
Are the official national data credible? Empirical evidence from statistics quality evaluation of China's coal and its downstream industries
2022, Energy Economics
The authenticity and quality of industrial statistical data directly affects all types of systematic research based on it. Considering the limitations of extant data quality evaluation literature on research objects and evaluation methods, we constructed a new data quality comprehensive inspection and evaluation model based on Benford's Law (BL) and the technique for order of preference by similarity to ideal solution (TOPSIS), selected coal-related industries as the research object, and conducted an empirical test along the research path of “Industry→Province→Indicator”. The results showed that, at industry level, the quality of statistical data for China's coal-related industries from 2001 to 2016 was generally poor. Among the eight sample industries selected, the data quality for five industries (including coal, electricity, and steel) was assessed as poor or slightly poor. Furthermore, at the provincial level, there is significant spatial heterogeneity in the quality of statistical data for various industries affected by factors such as economic structure, marketization level, and industrial diversity. Compared with other types of statistical indicators, industry financial indicators are more prone to data quality problems at the indicator level, and the suspicious indicators of different industries show certain common characteristics and some industry differences. To improve the quality of industrial statistical data and reduce the possible adverse impacts of data quality problems, based on the research findings, we propose targeted countermeasures and suggestions on how to prevent data fraud and effectively identify and rationally use suspicious data.

View all citing articles on Scopus

View full text

Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Data imputation

Artificial neural networks

Empirical experiments

IMLP model

MIMLP model

Experimental results and discussion

Conclusions

Expert Syst. Appl.

Knowl. Based Syst.

Neural Netw.

Discret. Appl. Math.

A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks

Int. J. Innov. Comput. Inf. Control

Hybrid model for data imputation: using fuzzy c means and multi layer perceptron

Neural Network Toolbox for Use With Matlab. User's Guide

An investigation of missing data methods for classification trees applied to binary response data

J. Mach. Learn. Res.

Interim Report on Evaluation Criteria for Statistical Editing and Imputation

Self-organising map for data imputation and correction in surveys

Neural Comput. Appl.

UCI Machine Learning Repository

A general coefficient of similarity and some of its properties

Biometrics

Imputation of missing values in a precipitation–runoff process database

Hydrol. Res.