Euclidean distance estimation in incomplete datasets
Introduction
Data completeness is a major assumption of most machine learning methods. In real world problems, however, several data instances may suffer from unobserved/missing attributes. This issue, referred to as missing/incomplete data problem, may happen due to a variety of reasons such as sensor problems, device malfunction and operator mistakes [1]. The simplest way to deal with missing data consists of removing the instances with missing attributes (listwise deletion) from the dataset. Even though this approach may work in some cases, discarding data samples usually leads to loss of important information to build a learning model [2]. Another widely used approach is to perform a pre-processing step of missing data imputation. After filling the missing entries, any conventional learning method can be used. Examples of such an approach can be found in [3], [4], [5] and [6].
According to Acuña and Rodrigues in [7], problems with more than 5% of missing samples may require sophisticated handling methods. In such situations, good results can be achieved by not considering the imputation as a separate step. Rather, it is possible to design a learning method that can handle incomplete data in its formulation. By doing so, the inherent uncertainty of the imputation process is taken into account and it has shown to be beneficial in many cases [8].
Computing the Euclidean distance is a key part in many machine learning methods, such as k-nearest neighbors [9], k-means [10] and Learning Vector Quantization [11]. In this paper, we propose a strategy to estimate pairwise Euclidean distance between vectors with missing values. The method, named Expected Euclidean Distance (EED) estimation, aims to directly determine the expected value of the Euclidean distance between two potentially incomplete vectors under the assumption that the distances are Nakagami-distributed [12]. The expected value of the Euclidean distance has a closed form solution that depends only on the two parameters of the Nakagami distribution. We show that these parameters can be expressed in terms of the non-central moments of the unknown data distribution. To model the data distribution, we adopt a Gaussian mixture distribution whose parameters are estimated using the maximum likelihood method [13].
To assess the effectiveness of the Expected Euclidean Distance estimation method in the context of supervised learning, we describe the application of the EED in the recently proposed Minimal Learning Machine (MLM, [14]). MLM is a distance-based supervised learning algorithm based on the idea of the existence of a mapping between the geometric configurations of points in the input and output space. These geometric configurations are expressed in terms of distances matrices between data samples. The proposed variant of the MLM for missing data uses the EED to build the input distance matrix. Both EED and its application to the MLM are benchmarked using artificial and real-world datasets. Based on the experiments, we show that EED (i) represents a promising alternative to compute distances in cases where missing data is an issue; and (ii) it can be successfully applied to distance-based learning methods.
The remainder of this paper is structured as follows. Section 2 explores related works on missing data and distance estimation. Section 3 states the problem of estimating Euclidean distances on datasets with missing values and introduces the proposed solution. In Section 4, we propose a MLM variant for missing data. Experiments on synthetic and real-world datasets are presented in Section 5. Conclusions and future work are given in Section 6.
Section snippets
Related work
The problem of missing data can be classified according to the missingness mechanism. Little and Rubin in [15] characterize the mechanisms as Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). In MCAR, the missingness of a value is independent of its value and any other component of the dataset. A less restrictive assumption is the MAR. MAR is a more likely to happen in reality [16] and states that the missingness of a component is independent of the
The expected euclidean distance
Consider two D-dimensional vectors and the Euclidean distance η between Xi and Xj is given by where is the squared distance between Xi and Xj.
We are interested in estimating η when any of the vectors have one or more missing component. Specifically, we consider the case that unobserved values in Xi and Xj are MAR [15]. Furthermore, we assume Xi and Xj are independent.
Application to the Minimal Learning Machine
Minimal Learning Machine is a recently proposed distance-based supervised learning algorithm. The basic idea behind the MLM is the existence of a linear mapping between distances taken from the input and output spaces [14]. The MLM can be divided into two steps: distance regression (training) and output estimation (test).
Consider a training set with N examples, such that and are respectively a set of D-dimensional input points and their S-dimensional outputs. The
Experiments and results
To assess the performance of the proposed method, we conducted four experiments with both synthetic and real world data. The first three experiments aim to evaluate how well the EED method reconstruct Euclidean distances. In the latter, we evaluate the usage of the EED in the design of the MLM for incomplete data.
The EED method is compared to the Conditional Mean Imputation (CMI, [24]) and the Expected Squared Distance (ESD,[1]). In CMI the missing entries are filled with its expected value
Conclusion
In this paper, we presented an algorithm to estimate the pairwise Euclidean distance between vectors with missing data. In the proposed method, named Expected Euclidean Distance (EED), the distance between vectors is said to follow a Nakagami distribution whose parameters can be calculated using the first four moments of the dataset distribution. In EED, a GMM is used to model the distribution of the dataset.
The accuracy of EED in distance estimation is verified using experiments on synthetic
Diego Parente P. Mesquita a bachelor’s degree on Computer Science from Universidade Federal do Ceará (UFC, 2015) and is currently pursuing a master’s degree in Computer Science at the same university.
References (26)
- et al.
Mixture of gaussians for distance estimation with missing data
Neurocomputing
(2014) - et al.
Distance estimation in numerical data sets with missing values
Inform. Sci.
(2013) Locally linear reconstruction based missing value imputation for supervised learning
Neurocomputing
(2013)- et al.
Multi-objective genetic algorithm for missing data imputation
Pattern Recog. Lett.
(2015) - et al.
A neural network-based framework for the reconstruction of incomplete data sets
Neurocomputing
(2010) - et al.
Nearest neighbor pattern classification
IEEE Trans. Inform. Theory
(1967) - M. Lichman, UCI Machine Learning Repository,...
- et al.
Techniques for dealing with incomplete data: a tutorial and survey
Pattern Anal. Appl.
(2015) - E. Acuña, C. Rodriguez, The Treatment of Missing Values and its Effect on Classifier Accuracy, Springer Berlin...
- et al.
Extreme learning machine for missing data using multiple imputations
Neurocomputing
(2016)
Some methods for classification and analysis of multivariate observations
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics
Self-organization and Associative Memory: 3rd Edition
The m-distribution–A general formula of intensity distribution of rapid fading
Cited by (81)
Priority Corridor Zone for Human-Tiger Conflict Mitigation: A Landscape Connectivity Approach in West Sumatra Region, Indonesia
2023, Journal for Nature ConservationA genetic algorithm for multivariate missing data imputation
2023, Information SciencesCitation Excerpt :The main features of these works is that they solve supervised problems i.e. missing data are removed data indeed and they use a single-goal (usually RSME). Multivariate missing data problems were addressed by Sovilj et al. [31] who applied extreme machine learning combined to a regression method initialized in two ways: conditional mean and multiple imputation, both based in mixture Gaussian distributions evaluated with its squared error over six different percentages of data (5%-30%) in six different UCI and LIACC repositories (no comparison to other approaches were done), Mesquita et al. [24] used a mix of expected Euclidean distance and the minimal machine learning method (which implicitly tries to preserve the variance of every variable via Gaussian kernels) which outperformed the conditional mean imputation and expected squared distance methods evaluated with RMSE and relative success rate measures over five benchmark datasets taken from the UCI repository for seven different percentages of removed data (10%–70%), Mesquita et al. [23] proposed to use Gaussian kernel neural network to 11 UCI databases for six missing data percentages (10, 30, 50, 60 and 70%) evaluated with three different measures: RMSE, ARMSE, standard deviation and compared to four methods: Incomplete case k-nearest-neighbors imputation algorithm, singular value thresholding, conditional mean imputation and expected square distance method with mixed results in its performance, Lai et al. [18] used an auto–encoder multi–task learning model (a Gaussian mixture neural network model with a single hidden layer) to classify incomplete datasets having interdependencies among features with six different percentages of removed data (5%-30%) over 7 datasets from the UCI and KEEL repositories which outperformed other six methods: Mean imputation, hot deck imputation, K-nearest neighbors, self-organizing map, multi-layer perceptron and MLP-based multi-task learning classification methods evaluated using RMSE, MAPE and accuracy measures, and Huang et al. [13] used decision trees with a mutual information measure over six different benchmark datasets taken from the UCI repository where five different percentages of data (10%-50%) were removed, evaluated using precision and recall measures and compared to four algorithms: Multi-granulation ensemble classification, mean imputation, k-nearest neighbors and a class center based approach which were outperformed by the proposed method; all these approaches assume a distribution/kernel, solve a single goal (usually RSME) over supervised datasets and they cannot impute/estimate discrete/binary data so there is a need for solving unsupervised problems with mixed discrete/binary/continuous data while preserving its multivariate properties. In contrast to the above-mentioned works, our approach deals with unsupervised continuous/discrete/binary multivariate data while minimizing a new multiobjective fitness function intended to preserve the original means, variances, covariances and skewness.
Reconstruction and application of the temperature-vegetation-precipitation drought index in mainland China based on remote sensing datasets and a spatial distance model
2022, Journal of Environmental ManagementCitation Excerpt :The spatial distance model and its principle were used to construct the iTVPDI integrated drought index (Fig. 3a). The principle of the spatial distance model is based on the Euclidean distance (Mesquita et al., 2017), which is a method that measures the absolute distance between multiple points in space and is a more common definition of distance at present. Because of its objectivity, scientificity, and generality, it is usually applied in many fields.
Time series clustering via matrix profile and community detection
2022, Advanced Engineering Informatics
Diego Parente P. Mesquita a bachelor’s degree on Computer Science from Universidade Federal do Ceará (UFC, 2015) and is currently pursuing a master’s degree in Computer Science at the same university.
João Paulo Pordeus Gomes holds a bachelor’s degree on Electrical Engineering from Universidade Federal do Ceará (UFC, 2004), Brazil, master’s (2006) degree on aeronautical Engineering and doctorate’s (2011) degree in electronic engineering from Instituto Tecnológico de Aeronáutica (ITA), São José dos Campos, SP, Brazil. Dr. Gomes worked for EMBRAER S.A. between 2006 and 2013, as a Technology Development Engineer focusing on fault monitoring applications on aeronautical systems. He is currently an Assistant Professor at UFC
Amauri Holanda de Souza Junior holds a bachelor’s degree on Telematics from Federal Institute of Ceará (IFCE, 2007), Brazil, master’s (2006) degree and doctorate’s (2014) degree in Teleinformatics Engineering from Universidade Federal do Ceará (UFC),Fortaleza, CE, Brazil. He is currently an Assistant Professor at IFCE.
Juvêncio Santos Nobre holds a bachelor’s degree on Statistics from Universidade Federal do Ceará (UFC, 2002), Brazil, master’s (2004) degree and doctorate’s (2017) degree in Statistics from Universidade de São Paulo (USP), São Paulo, SP, Brazil. He is currently an Assistant Professor at UFC.