Learn++.MF: A random subspace approach for the missing feature problem
Introduction
The integrity and completeness of data are essential for any classification algorithm. After all, a trained classifier – unless specifically designed to address this issue – cannot process instances with missing features, as the missing number(s) in the input vectors would make the matrix operations involved in data processing impossible. To obtain a valid classification, the data to be classified should be complete with no missing features (henceforth, we use missing data and missing features interchangeably). Missing data in real world applications is not uncommon: bad sensors, failed pixels, unanswered questions in surveys, malfunctioning equipment, medical tests that cannot be administered under certain conditions, etc. are all familiar scenarios in practice that can result in missing features. Feature values that are beyond the expected dynamic range of the data due to extreme noise, signal saturation, data corruption, etc. can also be treated as missing data. Furthermore, if the entire data are not acquired under identical conditions (time/location, using the same equipment, etc.), different data instances may be missing different features.
Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8×8 grid, creating 64 features, f1–f64, a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each case. Having such a large proportion of randomly varying features may be viewed as an extreme and unlikely scenario, warranting reacquisition of the entire dataset. However, data reacquisition is often expensive, impractical, or sometimes even impossible, justifying the need for an alternative practical solution.
The classification algorithm described in this paper is designed to provide such a practical solution, accommodating missing features subject to the condition of distributed redundancy (discussed in Section 3), which is satisfied surprisingly often in practice.
The simplest approach for dealing with missing data is to ignore those instances with missing attributes. Commonly referred to as filtering or list wise deletion approaches, such techniques are clearly suboptimal when a large portion of the data have missing attributes [1], and of course are infeasible, if each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2], [3], [4], [5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach include replacing the missing value with one of the existing data points (most similar in some measure) as in hot – deck imputation [2], [6], [7], the mean of that attribute across the data, or the mean of its k-nearest neighbors [4]. In order for the estimate to be a faithful representative of the missing value, however, the training data must be sufficiently dense, a requirement rarely satisfied for datasets with even modest number of features. Furthermore, imputation methods are known to produce biased estimates as the proportion of missing data increases. A related, but perhaps a more rigorous approach is to use polynomial regression to estimate substitutes for missing values [8]. However, regression techniques – besides being difficult to implement in high dimensions – assume that the data can reasonably fit to a particular polynomial, which may not hold for many applications.
Theoretically rigorous methods with provable performance guarantees have also been developed. Many of these methods rely on model based estimation, such as Bayesian estimation [9], [10], [11], which calculates posterior probabilities by integrating over the missing portions of the feature space. Such an approach also requires a sufficiently dense data distribution; but more importantly, a prior distribution for all unknown parameters must be specified, which requires prior knowledge. Such knowledge is typically vague or non-existent, and inaccurate choices often lead to inferior performance.
An alternative strategy in model based methods is the Expectation Maximization (EM) algorithm [12], [13], [14], [10], justly admired for its theoretical elegance. EM consists of two steps, the expectation (E) and the maximization (M) step, which iteratively maximize the expectation of the log-likelihood of the complete data, conditioned on observed data. Conceptually, these steps are easy to construct, and the range of problems that can be handled by EM is quite broad. However, there are two potential drawbacks: (i) convergence can be slow if large portions of data are missing; and (ii) the M step may be quite difficult if a closed form of the distribution is unknown, or if different instances are missing different features. In such cases, the theoretical simplicity of EM does not translate into practice [2]. EM also requires prior knowledge that is often unavailable, or estimation of an unknown underlying distribution, which may be computationally prohibitive for large dimensional datasets. Incorrect distribution estimation often leads to inconsistent results, whereas lack of sufficiently dense data typically causes loss of accuracy. Several variations have been proposed, such as using Gaussian mixture models [15], [16]; or Expected Conditional Maximization, to mitigate some of these difficulties [2].
Neural network based approaches have also been proposed. Gupta and Lam [17] looked at weight decay on datasets with missing values, whereas Yoon and Lee [18] proposed the Training-EStimation-Training (TEST) algorithm, which predicts the actual target value from a reasonably well estimated imputed one. Other approaches use neuro-fuzzy algorithms [19], where unknown values of the data are either estimated (or a classification is made using the existing features) by calculating the fuzzy membership of the data point to its nearest neighbors, clusters or hyperboxes. The parameters of the clusters and hyperboxes are determined from the existing data. Algorithms based on the general fuzzy min–max neural networks [20] or ARTMAP and fuzzy c-means clustering [21] are examples of this approach.
More recently, ensemble based approaches have also been proposed. For example, Melville et al. [22] showed that the algorithm DECORATE, which generates artificial data (with no missing values) from existing data (with missing values) is quite robust to missing data. On the other hand, Juszczak and Duin [23] proposed combining an ensemble of one-class classifiers, each trained on a single feature. This approach is capable of handling any combination of missing features, with the fewest number of classifiers possible. The approach can be very effective as long as single features can reasonably estimate the underlying decision boundaries, which is not always plausible.
In this contribution, we propose an alternative strategy, called Learn++.MF. It is inspired in part by our previously introduced ensemble-of-classifiers based incremental learning algorithm, Learn++, and in part by the random subspace method (RSM). In essence, Learn++.MF generates a sufficient number of classifiers, each trained on a random subset of features. An instance with one or more missing attributes is then classified by majority voting of those classifiers that did not use the missing features during their training. Hence, this approach differs in one fundamental aspect from the other techniques: Learn++.MF does not try to estimate the values of the missing data; instead it tries to extract the most discriminatory classification information provided by the existing data, by taking advantage of a presumed redundancy in the feature set. Hence, Learn++.MF avoids many of the pitfalls of estimation and imputation based techniques.
As in most missing feature approaches, we assume that the probability of a feature being missing is independent of the value of that or any other feature in the dataset. This model is referred to as missing completely at random (MCAR) [2]. Hence, given the dataset X=(xij) where xij represents the jth feature of instance xi, and a missing data indicator matrix M=(Mij), where Mij is 1 if xij is missing and 0 otherwise, we assume that . This is the most restrictive mechanism that generates missing data, the one that provides us with no additional information. However, this is also the only case where list-wise deletion or most imputation approaches lead to no bias.
In the rest of this paper, we first provide a review of ensemble systems, followed by the description of the Learn++.MF algorithm, and provide a theoretical analysis to guide the choice of its free parameters: number of features used to train each classifier, and the ensemble size. We then present results, analyze algorithm performance with respect to its free parameters, and compare performance metrics with theoretically expected values. We also compare the performance of Learn++.MF to that of a single classifier using mean imputation, to naïve Bayes that can naturally handle missing features, and an intermediate approach that combines RSM with mean imputation, as well as to that of one-class ensemble approach of [23]. We conclude with discussions and remarks.
Section snippets
Background: ensemble and random subspace methods
An ensemble-based system is obtained by combining diverse classifiers. The diversity in the classifiers, typically achieved by using different training parameters for each classifier, allows individual classifiers to generate different decision boundaries [24], [25], [26], [27]. If proper diversity is achieved, a different error is made by each classifier, strategic combination of which can then reduce the total error. Starting in early 1990s, with such seminal works as [28], [29], [30], [13],
Assumptions and targeted applications of the proposed approach
The essence of the proposed approach is to generate a sufficiently large number of classifiers, each trained with a random subset of features. When an instance x with missing feature(s) needs to be classified, only those classifiers trained with the features that are presently available in x are used for the classification. As such, Learn++.MF follows an alternative paradigm for the solution of the missing feature problem: the algorithm tries to extract most of the information from the
Experimental setup and overview of results
Learn++.MF was evaluated on several databases with various number of features. The algorithm resulted in similar trends in all cases, which we discuss in detail below. Due to the amount of detail involved with each dataset, as well as for brevity and to avoid repetition of similar outcomes, we present representative full results on four datasets here, and provide results for many additional real world and UCI benchmark datasets online [48], a summary of which is provided at the end of this
Discussions and conclusions
The RSM in general, and Learn++.MF algorithm as a specific implementation of RSM in particular, are presented as alternative solutions to the missing feature problem. Learn++.MF creates an ensemble of classifiers, each trained with a random subset of the features, so that an instance with missing features can still be classified using classifiers whose training data did not include those attributes.
There are two important parameters of the algorithm: nof, the number of features used to train
Acknowledgements
This material is based upon the work supported by the National Science Foundation under Grant nos. ECCS-0239090 and ECCS 0926159. The authors acknowledge S. Krause for his contributions during the earliest stages of this work. The authors also wish to acknowledge the anonymous reviewers, as this paper has benefited greatly from addressing their comments and suggestions.
Dr. Robi Polikar is an Associate Professor of Electrical and Computer Engineering at Rowan University, in Glassboro, New Jersey where he leads the Signal Processing and Pattern Recognition Laboratory. He has received his co-major Ph.D. degree in Electrical Engineering and Biomedical Engineering, from Iowa State University, Ames, Iowa in 2000. His current research interests within computational intelligence include ensemble systems, incremental and nonstationary learning, and various
References (48)
- et al.
Impact of imputation of missing values on classification error for discrete data
Pattern Recognition
(2008) - et al.
Empirical likelihood confidence intervals for differences between two datasets with missing data
Pattern Recognition Letters
(2008) - et al.
Imputation through finite Gaussian mixture models
Computational Statistics and Data Analysis
(2007) - et al.
Using diversity of errors for selecting members of a committee classifier
Pattern Recognition
(2006) Ensemble diversity measures and their application to thinning
Information Fusion
(2005)Diversity creation methods: a survey and categorisation
Information Fusion
(2005)- et al.
Moderate diversity for better cluster ensembles
Information Fusion
(2006) Stacked generalization
Neural Networks
(1992)- et al.
Decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences
(1997) - et al.
Diversity in search strategies for ensemble feature selection
Information Fusion
(2005)
Statistical Analysis with Missing Data
A reappraisal of distance-weighted K-nearest neighbor classification for pattern recognition with missing data
IEEE Transactions on Systems, Man and Cybernetics
The treatment of missing data, in The SAGE Handbook of Social Science Methodology
Consistent regression methods for discriminant analysis with incomplete data
Journal of the American Statistical Association
Some solution to the missing feature problem in data classification, with application to noise robust ASR
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 98)
Efficient methods for dealing with missing data in supervised learning
Neural Information Processing Systems
Robust learning with missing data
Machine Learning
Maximum likelihood from incomplete data via the EM algorithm
Journal of the Royal Statistical Society
Hierarchical mixtures of experts and the EM algorithm
Neural Computation
The EM Algorithm and Extensions
On classification with incomplete data
IEEE Transactions on Pattern Analysis and Machine Intelligence
The weight decay backpropagation for generalizations with missing values
Annals of Operations Research
Cited by (72)
Nature of decision valuations in elimination of redundant attributes
2024, International Journal of Approximate ReasoningExtended rough sets model based on fuzzy granular ball and its attribute reduction
2023, Information SciencesCombining attention with spectrum to handle missing values on time series data without imputation
2022, Information SciencesCitation Excerpt :For each sample, each subset of variables with no missing values can be fed into the corresponding classifier, and the outputs are weighted and combined to obtain the final result. This approach does not impute missing values, and may avoid the shortcomings of the imputation [27,32]. However, as the number of variables increases, the number of classifiers may grow exponentially [32], limiting this method's lifelong use.
Modulo 9 model-based learning for missing data imputation
2021, Applied Soft ComputingNeural random subspace
2021, Pattern Recognition
Dr. Robi Polikar is an Associate Professor of Electrical and Computer Engineering at Rowan University, in Glassboro, New Jersey where he leads the Signal Processing and Pattern Recognition Laboratory. He has received his co-major Ph.D. degree in Electrical Engineering and Biomedical Engineering, from Iowa State University, Ames, Iowa in 2000. His current research interests within computational intelligence include ensemble systems, incremental and nonstationary learning, and various applications of pattern recognition in biomedical engineering, such as brain-computer interface and early diagnosis of Alzheimer's disease based on EEG analysis. He is a senior member of IEEE, and member of ASEE, Tau Beta Pi and Eta Kappa Nu. His current and recent work has been funded primarily by NSF and NIH.
Joseph DePasquale is a Master of Science student at Rowan University. His areas of interest include applications of ensemble systems to various areas including bioinformatics.
Hussein Syed-Mohamed has recently completed his Master of Science degree at Rowan University and is currently a System Engineer at Honeywell.
Dr. Gavin Brown is a Lecturer at the University of Manchester, UK, in the Machine Learning and Optimization Group. He received his Ph.D. from the University of Birmingham in 2004, on the topic of neural network ensembles, for which he also received the Distinguished Dissertation Award from the British Computer Society. His current areas of interest are feature selection and extraction with information theoretic methods, Markov Blanket algorithms, and online learning with particular application to systems biology and adaptive compiler optimization.
Dr. Ludmilla Kuncheva received her M.Sc. degree from the Technical University of Sofia, Bulgaria, in 1982, and her Ph.D. degree from the Bulgarian Academy of Sciences in 1987. Until 1997 she worked at the Central Laboratory of Biomedical Engineering at the Bulgarian Academy of Sciences. Dr. Kuncheva is currently a Reader at the School of Computer Science, Bangor University, UK. Her interests include pattern recognition and classification, machine learning, classifier combination and fMRI data analysis. She has published two books and above 150 scientific papers.