Elsevier

Neurocomputing

Volume 248, 26 July 2017, Pages 11-18
Neurocomputing

Euclidean distance estimation in incomplete datasets

https://doi.org/10.1016/j.neucom.2016.12.081Get rights and content

Abstract

This paper proposes a method to estimate the expected value of the Euclidean distance between two possibly incomplete feature vectors. Under the Missing at Random assumption, we show that the Euclidean distance can be modeled by a Nakagami distribution, for which the parameters we express as a function of the moments of the unknown data distribution. In our formulation the data distribution is modeled using a mixture of Gaussians. The proposed method, named Expected Euclidean Distance (EED), is validated through a series of experiments using synthetic and real-world data. Additionally, we show the application of EED to the Minimal Learning Machine (MLM), a distance-based supervised learning method. Experimental results show that EED outperforms existing methods that estimate Euclidean distances in an indirect manner. We also observe that the application of EED to the MLM provides promising results.

Introduction

Data completeness is a major assumption of most machine learning methods. In real world problems, however, several data instances may suffer from unobserved/missing attributes. This issue, referred to as missing/incomplete data problem, may happen due to a variety of reasons such as sensor problems, device malfunction and operator mistakes [1]. The simplest way to deal with missing data consists of removing the instances with missing attributes (listwise deletion) from the dataset. Even though this approach may work in some cases, discarding data samples usually leads to loss of important information to build a learning model [2]. Another widely used approach is to perform a pre-processing step of missing data imputation. After filling the missing entries, any conventional learning method can be used. Examples of such an approach can be found in [3], [4], [5] and [6].

According to Acuña and Rodrigues in [7], problems with more than 5% of missing samples may require sophisticated handling methods. In such situations, good results can be achieved by not considering the imputation as a separate step. Rather, it is possible to design a learning method that can handle incomplete data in its formulation. By doing so, the inherent uncertainty of the imputation process is taken into account and it has shown to be beneficial in many cases [8].

Computing the Euclidean distance is a key part in many machine learning methods, such as k-nearest neighbors [9], k-means [10] and Learning Vector Quantization [11]. In this paper, we propose a strategy to estimate pairwise Euclidean distance between vectors with missing values. The method, named Expected Euclidean Distance (EED) estimation, aims to directly determine the expected value of the Euclidean distance between two potentially incomplete vectors under the assumption that the distances are Nakagami-distributed [12]. The expected value of the Euclidean distance has a closed form solution that depends only on the two parameters of the Nakagami distribution. We show that these parameters can be expressed in terms of the non-central moments of the unknown data distribution. To model the data distribution, we adopt a Gaussian mixture distribution whose parameters are estimated using the maximum likelihood method [13].

To assess the effectiveness of the Expected Euclidean Distance estimation method in the context of supervised learning, we describe the application of the EED in the recently proposed Minimal Learning Machine (MLM, [14]). MLM is a distance-based supervised learning algorithm based on the idea of the existence of a mapping between the geometric configurations of points in the input and output space. These geometric configurations are expressed in terms of distances matrices between data samples. The proposed variant of the MLM for missing data uses the EED to build the input distance matrix. Both EED and its application to the MLM are benchmarked using artificial and real-world datasets. Based on the experiments, we show that EED (i) represents a promising alternative to compute distances in cases where missing data is an issue; and (ii) it can be successfully applied to distance-based learning methods.

The remainder of this paper is structured as follows. Section 2 explores related works on missing data and distance estimation. Section 3 states the problem of estimating Euclidean distances on datasets with missing values and introduces the proposed solution. In Section 4, we propose a MLM variant for missing data. Experiments on synthetic and real-world datasets are presented in Section 5. Conclusions and future work are given in Section 6.

Section snippets

Related work

The problem of missing data can be classified according to the missingness mechanism. Little and Rubin in [15] characterize the mechanisms as Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). In MCAR, the missingness of a value is independent of its value and any other component of the dataset. A less restrictive assumption is the MAR. MAR is a more likely to happen in reality [16] and states that the missingness of a component is independent of the

The expected euclidean distance

Consider two D-dimensional vectors Xi=(xi,1,,xi,D)T and Xj=(xj,1,,xj,D)T, the Euclidean distance η between Xi and Xj is given by η=z1/2=d=1D(xi,dxj,d)2,where z=XiXj22=d=1D(xi,dxj,d)2 is the squared distance between Xi and Xj.

We are interested in estimating η when any of the vectors Xi,XjX have one or more missing component. Specifically, we consider the case that unobserved values in Xi and Xj are MAR [15]. Furthermore, we assume Xi and Xj are independent.

Application to the Minimal Learning Machine

Minimal Learning Machine is a recently proposed distance-based supervised learning algorithm. The basic idea behind the MLM is the existence of a linear mapping between distances taken from the input and output spaces [14]. The MLM can be divided into two steps: distance regression (training) and output estimation (test).

Consider a training set D={X,Y} with N examples, such that X={Xi}i=1N and Y={Yi}i=1N are respectively a set of D-dimensional input points and their S-dimensional outputs. The

Experiments and results

To assess the performance of the proposed method, we conducted four experiments with both synthetic and real world data. The first three experiments aim to evaluate how well the EED method reconstruct Euclidean distances. In the latter, we evaluate the usage of the EED in the design of the MLM for incomplete data.

The EED method is compared to the Conditional Mean Imputation (CMI, [24]) and the Expected Squared Distance (ESD,[1]). In CMI the missing entries are filled with its expected value

Conclusion

In this paper, we presented an algorithm to estimate the pairwise Euclidean distance between vectors with missing data. In the proposed method, named Expected Euclidean Distance (EED), the distance between vectors is said to follow a Nakagami distribution whose parameters can be calculated using the first four moments of the dataset distribution. In EED, a GMM is used to model the distribution of the dataset.

The accuracy of EED in distance estimation is verified using experiments on synthetic

Diego Parente P. Mesquita a bachelor’s degree on Computer Science from Universidade Federal do Ceará (UFC, 2015) and is currently pursuing a master’s degree in Computer Science at the same university.

References (26)

  • J. MacQueen

    Some methods for classification and analysis of multivariate observations

    Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics

    (1967)
  • T. Kohonen

    Self-organization and Associative Memory: 3rd Edition

    (1989)
  • M. Nakagami

    The m-distribution–A general formula of intensity distribution of rapid fading

  • Cited by (81)

    • A genetic algorithm for multivariate missing data imputation

      2023, Information Sciences
      Citation Excerpt :

      The main features of these works is that they solve supervised problems i.e. missing data are removed data indeed and they use a single-goal (usually RSME). Multivariate missing data problems were addressed by Sovilj et al. [31] who applied extreme machine learning combined to a regression method initialized in two ways: conditional mean and multiple imputation, both based in mixture Gaussian distributions evaluated with its squared error over six different percentages of data (5%-30%) in six different UCI and LIACC repositories (no comparison to other approaches were done), Mesquita et al. [24] used a mix of expected Euclidean distance and the minimal machine learning method (which implicitly tries to preserve the variance of every variable via Gaussian kernels) which outperformed the conditional mean imputation and expected squared distance methods evaluated with RMSE and relative success rate measures over five benchmark datasets taken from the UCI repository for seven different percentages of removed data (10%–70%), Mesquita et al. [23] proposed to use Gaussian kernel neural network to 11 UCI databases for six missing data percentages (10, 30, 50, 60 and 70%) evaluated with three different measures: RMSE, ARMSE, standard deviation and compared to four methods: Incomplete case k-nearest-neighbors imputation algorithm, singular value thresholding, conditional mean imputation and expected square distance method with mixed results in its performance, Lai et al. [18] used an auto–encoder multi–task learning model (a Gaussian mixture neural network model with a single hidden layer) to classify incomplete datasets having interdependencies among features with six different percentages of removed data (5%-30%) over 7 datasets from the UCI and KEEL repositories which outperformed other six methods: Mean imputation, hot deck imputation, K-nearest neighbors, self-organizing map, multi-layer perceptron and MLP-based multi-task learning classification methods evaluated using RMSE, MAPE and accuracy measures, and Huang et al. [13] used decision trees with a mutual information measure over six different benchmark datasets taken from the UCI repository where five different percentages of data (10%-50%) were removed, evaluated using precision and recall measures and compared to four algorithms: Multi-granulation ensemble classification, mean imputation, k-nearest neighbors and a class center based approach which were outperformed by the proposed method; all these approaches assume a distribution/kernel, solve a single goal (usually RSME) over supervised datasets and they cannot impute/estimate discrete/binary data so there is a need for solving unsupervised problems with mixed discrete/binary/continuous data while preserving its multivariate properties. In contrast to the above-mentioned works, our approach deals with unsupervised continuous/discrete/binary multivariate data while minimizing a new multiobjective fitness function intended to preserve the original means, variances, covariances and skewness.

    • Reconstruction and application of the temperature-vegetation-precipitation drought index in mainland China based on remote sensing datasets and a spatial distance model

      2022, Journal of Environmental Management
      Citation Excerpt :

      The spatial distance model and its principle were used to construct the iTVPDI integrated drought index (Fig. 3a). The principle of the spatial distance model is based on the Euclidean distance (Mesquita et al., 2017), which is a method that measures the absolute distance between multiple points in space and is a more common definition of distance at present. Because of its objectivity, scientificity, and generality, it is usually applied in many fields.

    View all citing articles on Scopus

    Diego Parente P. Mesquita a bachelor’s degree on Computer Science from Universidade Federal do Ceará (UFC, 2015) and is currently pursuing a master’s degree in Computer Science at the same university.

    João Paulo Pordeus Gomes holds a bachelor’s degree on Electrical Engineering from Universidade Federal do Ceará (UFC, 2004), Brazil, master’s (2006) degree on aeronautical Engineering and doctorate’s (2011) degree in electronic engineering from Instituto Tecnológico de Aeronáutica (ITA), São José dos Campos, SP, Brazil. Dr. Gomes worked for EMBRAER S.A. between 2006 and 2013, as a Technology Development Engineer focusing on fault monitoring applications on aeronautical systems. He is currently an Assistant Professor at UFC

    Amauri Holanda de Souza Junior holds a bachelor’s degree on Telematics from Federal Institute of Ceará (IFCE, 2007), Brazil, master’s (2006) degree and doctorate’s (2014) degree in Teleinformatics Engineering from Universidade Federal do Ceará (UFC),Fortaleza, CE, Brazil. He is currently an Assistant Professor at IFCE.

    Juvêncio Santos Nobre holds a bachelor’s degree on Statistics from Universidade Federal do Ceará (UFC, 2002), Brazil, master’s (2004) degree and doctorate’s (2017) degree in Statistics from Universidade de São Paulo (USP), São Paulo, SP, Brazil. He is currently an Assistant Professor at UFC.

    View full text