Elsevier

Pattern Recognition Letters

Volume 27, Issue 12, September 2006, Pages 1383-1389
Pattern Recognition Letters

Data complexity assessment in undersampled classification of high-dimensional biomedical data

https://doi.org/10.1016/j.patrec.2006.01.006Get rights and content

Abstract

Regularized linear classifiers have been successfully applied in undersampled, i.e. small sample size/high dimensionality biomedical classification problems. Additionally, a design of data complexity measures was proposed in order to assess the competence of a classifier in a particular context. Our work was motivated by the analysis of ill-posed regression problems by Elden and the interpretation of linear discriminant analysis as a mean square error classifier. Using Singular Value Decomposition analysis, we define a discriminatory power spectrum and show that it provides useful means of data complexity assessment for undersampled classification problems.

In five real-life biomedical data sets of increasing difficulty we demonstrate how the data complexity of a classification problem can be related to the performance of regularized linear classifiers. We show that the concentration of the discriminatory power manifested in the discriminatory power spectrum is a deciding factor for the success of the regularized linear classifiers in undersampled classification problems. As a practical outcome of our work, the proposed data complexity assessment may facilitate the choice of a classifier for a given undersampled problem.

Introduction

Regularized linear discriminants have been successfully applied in undersampled, (i.e. small sample size combined with high dimensionality), biomedical classification problems e.g., for gene microarrays or biomedical spectra. They are often competitive with other (nonlinear) state-of-art algorithms (Tibshirani et al., 2002). In the small sample size–high dimensionality scenario, overfitting is a major issue (Somorjai et al., 2003a, Simon et al., 2004, Ambroise and McLachlan, 2002). Due to their limited capacity, the use of linear classifiers in undersampled biomedical problems has been advocated (Simon et al., 2004). There are numerous successful applications of regularized linear classifiers in undersampled, and in particular, biomedical problems; they include partial least squares (PLS) (Nguyen and Rocke, 2002), support vector machines (SVM) (Guyon et al., 2002), pseudoinverse linear discriminant analysis (LDA) (Ye et al., 2004), shrunken centroids (Tibshirani et al., 2002) etc.

Data complexity measures for classification have been proposed by Ho and Basu (2002) in order to discern the behavior of the classification algorithms in a given context.

Motivated by the analysis of (Elden, 2004) linear regression problems, we developed an approach to estimate data complexity in the small sample size/high dimensionality classification problems, using singular value decomposition (SVD) analysis and the minimum square error (MSE) formulation of the linear discriminant analysis. Using five real-life biomedical datasets of increasing difficulty, we show how the data complexity measures of a given classification problem can be related to the performance of linear classifiers.

Section snippets

Linear regression and the singular value decomposition

Let X be a centered data matrix with dimensions (n, pdim). Consider the singular value decomposition (SVD) of X:X=USVT=TVT,where U is an n × n matrix of left singular vectors as columns, V is a pdim × pdim matrix of right singular vectors as columns, S is an n × pdim matrix, T is an n × pdim matrix.

The columns of U (or T), corresponding to the nonzero singular values of X, form a new set of eigenfeatures, which are usually referred to as principal components (Elden, 2004, Phatak and De Jong, 1997,

Data

The first four datasets we used comprised samples of magnetic resonance (MR) spectra. An MR spectrum is a collection of intensity peaks and valleys that carry discriminatory information due to the physical/chemical basis of the class separation (Lean et al., 2002). A typical MR spectrum (taken from dataset 2) is shown in Fig. 1a. Datasets 1–3 derive from the fast identification of pathogenic yeasts, using MR spectra (Himmelreich et al., 2003). Dataset 4 comprises MR spectra of biofluids

Classifiers and evaluation

We used two representatives of regularized linear classifiers in our experiments. As a realization of the regularized MSE LDA, we used PLS. We employed the NIPALS (Wold et al., 1984) algorithm as implemented in Matlab toobox PLSTOOLS (Eigenvector) for calculating the linear discriminant solution. We chose PLS, because it has been successfully applied in undersampled biomedical problems (Bennett and Embrechts, 2003) and also because of its relationship to the SVD analysis in regression (Elden,

Results

The classification errors for the SVM and PLS classifiers are given in Table 2. The PLS and SVM classifiers show comparable performance for all five datasets. The lowest classification error was achieved in Dataset 5 and the highest in Dataset 4, whereas moderate classification error was found for datasets 1, 2 and 3. Therefore, according to the classification error obtained by the two classifiers under consideration, Dataset 5 is the “easiest” and Dataset 4 is the “most difficult”.

The means of

Discussion

The PLS classifier is competitive with the SVM for all datasets. This performance of the PLS is in agreement with similar results obtained, e.g. by (Bennett and Embrechts, 2003, Elden, 2004). However, the main goal of our study was not the quest for the best classifier, but a better understanding of classifier behavior in specific real-life settings.

For all five datasets, the concentration of the discriminatory information is quite high, as manifested by dominating discriminatory peaks in the

Acknowledgements

We thank Drs. T. Sorrell and U. Himmelreich for providing us with some of the datasets used in our experiments. We also thank an anonymous reviewer for his comments, which greatly improved the presentation.

References (24)

  • U. Himmelreich

    Rapid identification of Candida species by using nuclear magnetic resonance spectroscopy and a statistical classification strategy

    Appl. Environ. Microbiol.

    (2003)
  • T.K. Ho et al.

    Complexity measures of supervised classification problems

    IEEE Trans. Pattern Anal. Machine Intell.

    (2002)
  • Cited by (36)

    • Analysis of data complexity measures for classification

      2013, Expert Systems with Applications
      Citation Excerpt :

      In Li, Dong, and Kothari (2005), Li et al. analyze some omnivariate decision trees using the measure of complexity based on data density proposed by Ho and Basu. Baumgartner and Somorjai (2006) define specific measures for regularized linear classifiers, using the Ho and Basu measures as reference. Mollineda et al. (2005) extend some Ho and Basu’s measure definitions for problems with two or more classes.

    • Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification

      2013, Pattern Recognition
      Citation Excerpt :

      From these works, different authors attempt to address different data mining problems using these measures. For example, Baumgartner and Somorjai [18] define specialized measures for regularized linear classifiers. Other authors try to explain the behavior of learning algorithms using these measures, optimizing the decision tree creation in the binarization of datasets [19] or to analyze fuzzy-UCS and the model obtained when applied to data streams [20].

    • Linear separability and classification complexity

      2012, Expert Systems with Applications
      Citation Excerpt :

      The same authors in a later work analyse how the performance of the k-nearest neighbour classifier can be affected by some sources of complexity such as class overlapping, feature space dimensionality and class density through five complexity measures, two of them being introduced in the paper. Baumgartner and Somorjai (2006) show how to measure the data complexity of a classification problem through the performance of regularized linear classifiers in a context of small sample size/high dimensionality. A more recent work by Luengo and Herrera (2010) studies the performance of fuzzy rules based classification system with relation to complexity using eight of the metrics defined by Ho and Basu (2002).

    • Shared domains of competence of approximate learning models using measures of separability of classes

      2012, Information Sciences
      Citation Excerpt :

      Li et al. [21] analyze some omnivariate decision trees using the measure of complexity based on data density proposed by Ho and Basu. Baumgartner and Somorjai define specific measures for regularized linear classifiers [4], using Ho and Basu’s measures as reference. Sánchez et al. analyze the effect of data complexity on the nearest neighbors classifier [32].

    View all citing articles on Scopus
    View full text