Elsevier

Applied Soft Computing

Volume 29, April 2015, Pages 65-74
Applied Soft Computing

Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns

https://doi.org/10.1016/j.asoc.2014.09.052Get rights and content

Highlights

  • Imputation data for monotone patterns of missing values.

  • An estimation model of missing data based on multilayer perceptron.

  • Combination of neural network and k-nearest neighbour-based multiple imputation.

  • Comparison of the performance of proposed models with three classic procedures.

  • Three classic single imputation models: mean/mode, regression and hot-deck.

Abstract

The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures – mean/mode imputation, regression and hot-deck – were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.

Introduction

Computer assisted personal interview (CAPI), computer assisted telephone interview (CATI) or web assisted personal interview (WAPI) are some of the most common data collection systems. However, none of them guarantees perfect data sets and a certain risk of error generation is always present. In particular, missing or inconsistent values might appear because of the lack of response or an inaccurate answer recording.

Missing values can be estimated using data imputation techniques, so the gaps are filled and a complete data set is obtained. The treatment of non-response errors is a fundamental step of data cleaning, in data knowledge discovery process, to improve the information quality. Statistical agencies usually have to apply imputation techniques on the data sets resulting from their survey process, proved to be a time-consuming task.

Artificial neural networks (henceforth termed ANNs) constitute flexible computing frameworks and universal approximators that can be applied to a wide range of prediction and classification problems with a high degree of accuracy. The application of different ANN approaches to data imputation has been studied previously from different points of views.

Kuligowski and Barros [12] introduced a backpropagation neural network for missing data estimation by using concurrent rainfall data from neighbouring gauges. Refs. [6], [25] dealt with self-organizing maps (SOM) as data imputation tools in different application areas. In other works, such as [22], the use of multiple imputations for the analysis of missing data was considered.

Kalteh and Hjorth [10] imputed missing values with SOM, multilayer perceptron, multivariate nearest neighbours, the regularised expectation maximization algorithm and multiple imputation in the context of a precipitation–runoff process database. Kaya et al. [11] carried out a comparison of the neural networks, the expectation maximization algorithm and the multiple imputation techniques, while the application of genetic algorithms was proposed in [15].

Subasi et al. [23] presented a new imputation method for incomplete binary data and in [21] a methodology for data imputation by ANNs was proposed and empirically compared with other data mining model, evaluating the performance of the imputation process by employing a variant of k-nearest neighbours (k-NN) method to the classification task on imputed databases.

García-Laencina et al. [8] presented a multi-task learning (MTL) based approach using multilayer perceptron to impute missing values in classification problems. They combined classification and imputation in only one neural architecture, being classification the main task and imputation the secondary task.

Rahman and Islam [18] proposed two techniques for the imputation of both categorical and numerical missing values, using decision trees and forests. The missing values were imputed using similarity and correlations, and they merged segments to achieve a higher quality of imputation. Azim and Aggarwal [2] described a two-stage hybrid model to fill missing values using fuzzy c-means clustering and multilayer perceptrons.

Aydilek and Arslan [1] used a hybrid neural network and weighted nearest neighbours to estimate missing values. The estimation system involved an auto-associative model to predict the input data, coupled with the k-nearest neighbours to approximate the missing data.

The present work focuses on a particular missing values pattern, the monotone pattern, where a set of variables is missing on the same set of records. This pattern is appropriate to build imputation models based on the aggregation of a set of predictions. A multiple imputation approach (MIMLP from now onwards) is proposed, relying its implementation on the combination of the multilayer perceptron (hereinafter MLP) and k-nearest neighbours. It is compared with an estimation model of missing values, also based on a multilayer perceptron (IMLP in what follows) and studied in [21].

Tusell's work [24] has been followed, but, regarding variable types, the present research extends to qualitative variables, so that a case is imputed considering complete cases with Gower's distance, further explained below. Additionally, this study also provides insights into the selection of parameter values. To compare the efficiency of both methods, the classical models Hot-deck, mean/mode substitution and regression models have been also implemented.

This paper is organised as follows. In Section 2, missing data patterns and mechanism, as well as imputation methods, are described. Section 3 introduces general aspects of artificial neural networks. Section 4 deals with experiments carried out. The automatic procedure to impute missing values based on ANNs: IMLP and MIMLP models, are described in Sections 5 and 6, respectively. The comparison with other well-known methods is presented in Section 7, where the neural network configuration is extensively studied on both MIMLP and IMLP models. The results and conclusions shown in Sections 7 and 8 reveal a clear improvement in the data set quality for this machine learning approach.

Section snippets

Data imputation

In Section 2.1, the different mechanisms that generate missing values are shown, and the missing data patterns are described. In Section 2.2, the three classical single imputation procedures (mean/mode imputation, regression and hot-deck), implemented to compare the efficiency of the proposed methods, are explained.

Artificial neural networks

The outputs oj of the considered three-layered perceptron are:

oj=w0j+h=1Hwhjg(v0h+i=1pvihxi),j=1,2,,qfor p inputs x1, …, xp, denoting by H the size of the hidden layer, {vih,i=0,1,2,,p,h=1,2,,H} the synaptic weights for the connections between the p-sized input and the hidden layer and {whj,h=0,1,2,,H,j=1,2,,q} the synaptic weights for the connections between the hidden and the q-sized output layer.

The hyperbolic tangent activation function g(u) = (eu  eu)/(eu + eu) is used in the hidden

Empirical experiments

Section 4.1 provides certain aspects of the extensive suite of experiments carried out. Section 4.2 describes the data sets used to study the models. In Section 4.3, preprocessing tasks and perturbations applied to the original data sets are explained. Finally, Section 4.4 contains details about the computed measures to evaluate the different data imputation models.

IMLP model

The IMLP fitting requires to take several decisions such as the random initialization of the MLP weights, the number of hidden units or the number of iterations (epochs) of the learning algorithm and, of course, the training algorithm. The parameter values can affect the imputation quality. Thus, different architectures were studied to obtain the best configuration or at least a suitable range of values.

It is a well-known fact that the random initial configuration of weights offers very

MIMLP model

The multiple imputation method consists in obtaining a vector of MI > 1 imputed data for each missing value. These imputed data are alternative values to fill the value of the incomplete data. From the imputation vectors, MI sets of complete data are generated. Each missing value is replaced by the first element of its imputation vector, obtaining the first complete data set. Then, each missing value is replaced by the second element of its imputation vector, obtaining the second complete data

Experimental results and discussion

The comparison of the different methods for missing data imputation on the different data sets is shown in this section: classical imputation procedures such as mean/mode, regression and hot-deck, and both models ANN-based: IMLP and MIMLP.

To complete the study, the evaluation of all the models was carried out from two different perspectives. Firstly, the performance of the imputing methods was according to the aforementioned evaluation criteria. Secondly, the performance of several

Conclusions

Two models for data imputation IMLP and MIMLP were proposed and empirically compared with three classical methods: mean/mode imputation, regression models and Hot-deck. In the IMLP model, based on artificial neural networks, several architectures and training algorithms for the multilayer perceptron were tested. In the MIMLP model, a multiple imputation technique combining multilayer perceptron and k-nearest neighbours, a wide range of parameter configurations λ and kmax was explored. This

References (26)

  • A. Frank et al.

    UCI Machine Learning Repository

    (2010)
  • J. Gower

    A general coefficient of similarity and some of its properties

    Biometrics

    (1971)
  • A. Kalteh et al.

    Imputation of missing values in a precipitation–runoff process database

    Hydrol. Res.

    (2009)
  • Cited by (94)

    View all citing articles on Scopus
    View full text