Elsevier

Pattern Recognition

Volume 43, Issue 11, November 2010, Pages 3817-3832
Pattern Recognition

Learn++.MF: A random subspace approach for the missing feature problem

https://doi.org/10.1016/j.patcog.2010.05.028Get rights and content

Abstract

We introduce Learn++.MF, an ensemble-of-classifiers based algorithm that employs random subspace selection to address the missing feature problem in supervised classification. Unlike most established approaches, Learn++.MF does not replace missing values with estimated ones, and hence does not need specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers, each on a random subset of the available features. Instances with missing values are classified by the majority voting of those classifiers whose training data did not include the missing features. We show that Learn++.MF can accommodate substantial amount of missing data, and with only gradual decline in performance as the amount of missing data increases. We also analyze the effect of the cardinality of the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the conditions under which the proposed approach is most effective.

Introduction

The integrity and completeness of data are essential for any classification algorithm. After all, a trained classifier – unless specifically designed to address this issue – cannot process instances with missing features, as the missing number(s) in the input vectors would make the matrix operations involved in data processing impossible. To obtain a valid classification, the data to be classified should be complete with no missing features (henceforth, we use missing data and missing features interchangeably). Missing data in real world applications is not uncommon: bad sensors, failed pixels, unanswered questions in surveys, malfunctioning equipment, medical tests that cannot be administered under certain conditions, etc. are all familiar scenarios in practice that can result in missing features. Feature values that are beyond the expected dynamic range of the data due to extreme noise, signal saturation, data corruption, etc. can also be treated as missing data. Furthermore, if the entire data are not acquired under identical conditions (time/location, using the same equipment, etc.), different data instances may be missing different features.

Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8×8 grid, creating 64 features, f1f64, a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each case. Having such a large proportion of randomly varying features may be viewed as an extreme and unlikely scenario, warranting reacquisition of the entire dataset. However, data reacquisition is often expensive, impractical, or sometimes even impossible, justifying the need for an alternative practical solution.

The classification algorithm described in this paper is designed to provide such a practical solution, accommodating missing features subject to the condition of distributed redundancy (discussed in Section 3), which is satisfied surprisingly often in practice.

The simplest approach for dealing with missing data is to ignore those instances with missing attributes. Commonly referred to as filtering or list wise deletion approaches, such techniques are clearly suboptimal when a large portion of the data have missing attributes [1], and of course are infeasible, if each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2], [3], [4], [5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach include replacing the missing value with one of the existing data points (most similar in some measure) as in hot – deck imputation [2], [6], [7], the mean of that attribute across the data, or the mean of its k-nearest neighbors [4]. In order for the estimate to be a faithful representative of the missing value, however, the training data must be sufficiently dense, a requirement rarely satisfied for datasets with even modest number of features. Furthermore, imputation methods are known to produce biased estimates as the proportion of missing data increases. A related, but perhaps a more rigorous approach is to use polynomial regression to estimate substitutes for missing values [8]. However, regression techniques – besides being difficult to implement in high dimensions – assume that the data can reasonably fit to a particular polynomial, which may not hold for many applications.

Theoretically rigorous methods with provable performance guarantees have also been developed. Many of these methods rely on model based estimation, such as Bayesian estimation [9], [10], [11], which calculates posterior probabilities by integrating over the missing portions of the feature space. Such an approach also requires a sufficiently dense data distribution; but more importantly, a prior distribution for all unknown parameters must be specified, which requires prior knowledge. Such knowledge is typically vague or non-existent, and inaccurate choices often lead to inferior performance.

An alternative strategy in model based methods is the Expectation Maximization (EM) algorithm [12], [13], [14], [10], justly admired for its theoretical elegance. EM consists of two steps, the expectation (E) and the maximization (M) step, which iteratively maximize the expectation of the log-likelihood of the complete data, conditioned on observed data. Conceptually, these steps are easy to construct, and the range of problems that can be handled by EM is quite broad. However, there are two potential drawbacks: (i) convergence can be slow if large portions of data are missing; and (ii) the M step may be quite difficult if a closed form of the distribution is unknown, or if different instances are missing different features. In such cases, the theoretical simplicity of EM does not translate into practice [2]. EM also requires prior knowledge that is often unavailable, or estimation of an unknown underlying distribution, which may be computationally prohibitive for large dimensional datasets. Incorrect distribution estimation often leads to inconsistent results, whereas lack of sufficiently dense data typically causes loss of accuracy. Several variations have been proposed, such as using Gaussian mixture models [15], [16]; or Expected Conditional Maximization, to mitigate some of these difficulties [2].

Neural network based approaches have also been proposed. Gupta and Lam [17] looked at weight decay on datasets with missing values, whereas Yoon and Lee [18] proposed the Training-EStimation-Training (TEST) algorithm, which predicts the actual target value from a reasonably well estimated imputed one. Other approaches use neuro-fuzzy algorithms [19], where unknown values of the data are either estimated (or a classification is made using the existing features) by calculating the fuzzy membership of the data point to its nearest neighbors, clusters or hyperboxes. The parameters of the clusters and hyperboxes are determined from the existing data. Algorithms based on the general fuzzy min–max neural networks [20] or ARTMAP and fuzzy c-means clustering [21] are examples of this approach.

More recently, ensemble based approaches have also been proposed. For example, Melville et al. [22] showed that the algorithm DECORATE, which generates artificial data (with no missing values) from existing data (with missing values) is quite robust to missing data. On the other hand, Juszczak and Duin [23] proposed combining an ensemble of one-class classifiers, each trained on a single feature. This approach is capable of handling any combination of missing features, with the fewest number of classifiers possible. The approach can be very effective as long as single features can reasonably estimate the underlying decision boundaries, which is not always plausible.

In this contribution, we propose an alternative strategy, called Learn++.MF. It is inspired in part by our previously introduced ensemble-of-classifiers based incremental learning algorithm, Learn++, and in part by the random subspace method (RSM). In essence, Learn++.MF generates a sufficient number of classifiers, each trained on a random subset of features. An instance with one or more missing attributes is then classified by majority voting of those classifiers that did not use the missing features during their training. Hence, this approach differs in one fundamental aspect from the other techniques: Learn++.MF does not try to estimate the values of the missing data; instead it tries to extract the most discriminatory classification information provided by the existing data, by taking advantage of a presumed redundancy in the feature set. Hence, Learn++.MF avoids many of the pitfalls of estimation and imputation based techniques.

As in most missing feature approaches, we assume that the probability of a feature being missing is independent of the value of that or any other feature in the dataset. This model is referred to as missing completely at random (MCAR) [2]. Hence, given the dataset X=(xij) where xij represents the jth feature of instance xi, and a missing data indicator matrix M=(Mij), where Mij is 1 if xij is missing and 0 otherwise, we assume that p(M|X)=p(M). This is the most restrictive mechanism that generates missing data, the one that provides us with no additional information. However, this is also the only case where list-wise deletion or most imputation approaches lead to no bias.

In the rest of this paper, we first provide a review of ensemble systems, followed by the description of the Learn++.MF algorithm, and provide a theoretical analysis to guide the choice of its free parameters: number of features used to train each classifier, and the ensemble size. We then present results, analyze algorithm performance with respect to its free parameters, and compare performance metrics with theoretically expected values. We also compare the performance of Learn++.MF to that of a single classifier using mean imputation, to naïve Bayes that can naturally handle missing features, and an intermediate approach that combines RSM with mean imputation, as well as to that of one-class ensemble approach of [23]. We conclude with discussions and remarks.

Section snippets

Background: ensemble and random subspace methods

An ensemble-based system is obtained by combining diverse classifiers. The diversity in the classifiers, typically achieved by using different training parameters for each classifier, allows individual classifiers to generate different decision boundaries [24], [25], [26], [27]. If proper diversity is achieved, a different error is made by each classifier, strategic combination of which can then reduce the total error. Starting in early 1990s, with such seminal works as [28], [29], [30], [13],

Assumptions and targeted applications of the proposed approach

The essence of the proposed approach is to generate a sufficiently large number of classifiers, each trained with a random subset of features. When an instance x with missing feature(s) needs to be classified, only those classifiers trained with the features that are presently available in x are used for the classification. As such, Learn++.MF follows an alternative paradigm for the solution of the missing feature problem: the algorithm tries to extract most of the information from the

Experimental setup and overview of results

Learn++.MF was evaluated on several databases with various number of features. The algorithm resulted in similar trends in all cases, which we discuss in detail below. Due to the amount of detail involved with each dataset, as well as for brevity and to avoid repetition of similar outcomes, we present representative full results on four datasets here, and provide results for many additional real world and UCI benchmark datasets online [48], a summary of which is provided at the end of this

Discussions and conclusions

The RSM in general, and Learn++.MF algorithm as a specific implementation of RSM in particular, are presented as alternative solutions to the missing feature problem. Learn++.MF creates an ensemble of classifiers, each trained with a random subset of the features, so that an instance with missing features can still be classified using classifiers whose training data did not include those attributes.

There are two important parameters of the algorithm: nof, the number of features used to train

Acknowledgements

This material is based upon the work supported by the National Science Foundation under Grant nos. ECCS-0239090 and ECCS 0926159. The authors acknowledge S. Krause for his contributions during the earliest stages of this work. The authors also wish to acknowledge the anonymous reviewers, as this paper has benefited greatly from addressing their comments and suggestions.

Dr. Robi Polikar is an Associate Professor of Electrical and Computer Engineering at Rowan University, in Glassboro, New Jersey where he leads the Signal Processing and Pattern Recognition Laboratory. He has received his co-major Ph.D. degree in Electrical Engineering and Biomedical Engineering, from Iowa State University, Ames, Iowa in 2000. His current research interests within computational intelligence include ensemble systems, incremental and nonstationary learning, and various

References (48)

  • H..Schöner, Working with real-world datasets: preprocessing and prediction with large incomplete and heterogeneous...
  • J.J.A. Little et al.

    Statistical Analysis with Missing Data

    (2002)
  • K.L..Wagstaff, V.G..Laidler, Making the most of missing values: object clustering with partial data in astronomy,...
  • R.L. Morin et al.

    A reappraisal of distance-weighted K-nearest neighbor classification for pattern recognition with missing data

    IEEE Transactions on Systems, Man and Cybernetics

    (1981)
  • D. Howell

    The treatment of missing data, in The SAGE Handbook of Social Science Methodology

  • R.J.A. Little

    Consistent regression methods for discriminant analysis with incomplete data

    Journal of the American Statistical Association

    (1978)
  • A.C. Morris et al.

    Some solution to the missing feature problem in data classification, with application to noise robust ASR

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 98)

    (1998)
  • V. Tresp et al.

    Efficient methods for dealing with missing data in supervised learning

    Neural Information Processing Systems

    (1995)
  • M. Ramoni et al.

    Robust learning with missing data

    Machine Learning

    (2001)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society

    (1977)
  • M.I. Jordan et al.

    Hierarchical mixtures of experts and the EM algorithm

    Neural Computation

    (1994)
  • G.J. Mclachlan et al.

    The EM Algorithm and Extensions

    (1992)
  • W. David

    On classification with incomplete data

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • A. Gupta et al.

    The weight decay backpropagation for generalizations with missing values

    Annals of Operations Research

    (1998)
  • Cited by (72)

    • Nature of decision valuations in elimination of redundant attributes

      2024, International Journal of Approximate Reasoning
    • Combining attention with spectrum to handle missing values on time series data without imputation

      2022, Information Sciences
      Citation Excerpt :

      For each sample, each subset of variables with no missing values can be fed into the corresponding classifier, and the outputs are weighted and combined to obtain the final result. This approach does not impute missing values, and may avoid the shortcomings of the imputation [27,32]. However, as the number of variables increases, the number of classifiers may grow exponentially [32], limiting this method's lifelong use.

    • Neural random subspace

      2021, Pattern Recognition
    View all citing articles on Scopus

    Dr. Robi Polikar is an Associate Professor of Electrical and Computer Engineering at Rowan University, in Glassboro, New Jersey where he leads the Signal Processing and Pattern Recognition Laboratory. He has received his co-major Ph.D. degree in Electrical Engineering and Biomedical Engineering, from Iowa State University, Ames, Iowa in 2000. His current research interests within computational intelligence include ensemble systems, incremental and nonstationary learning, and various applications of pattern recognition in biomedical engineering, such as brain-computer interface and early diagnosis of Alzheimer's disease based on EEG analysis. He is a senior member of IEEE, and member of ASEE, Tau Beta Pi and Eta Kappa Nu. His current and recent work has been funded primarily by NSF and NIH.

    Joseph DePasquale is a Master of Science student at Rowan University. His areas of interest include applications of ensemble systems to various areas including bioinformatics.

    Hussein Syed-Mohamed has recently completed his Master of Science degree at Rowan University and is currently a System Engineer at Honeywell.

    Dr. Gavin Brown is a Lecturer at the University of Manchester, UK, in the Machine Learning and Optimization Group. He received his Ph.D. from the University of Birmingham in 2004, on the topic of neural network ensembles, for which he also received the Distinguished Dissertation Award from the British Computer Society. His current areas of interest are feature selection and extraction with information theoretic methods, Markov Blanket algorithms, and online learning with particular application to systems biology and adaptive compiler optimization.

    Dr. Ludmilla Kuncheva received her M.Sc. degree from the Technical University of Sofia, Bulgaria, in 1982, and her Ph.D. degree from the Bulgarian Academy of Sciences in 1987. Until 1997 she worked at the Central Laboratory of Biomedical Engineering at the Bulgarian Academy of Sciences. Dr. Kuncheva is currently a Reader at the School of Computer Science, Bangor University, UK. Her interests include pattern recognition and classification, machine learning, classifier combination and fMRI data analysis. She has published two books and above 150 scientific papers.

    View full text