Bagging schemes on the presence of class noise in classification

https://doi.org/10.1016/j.eswa.2012.01.013Get rights and content

Abstract

In this paper, we study one application of Bagging credal decision tree, i.e. decision trees built using imprecise probabilities and uncertainty measures, on data sets with class noise (data sets with wrong assignations of the class label). For this aim, previously we also extend a original method that build credal decision trees to one which works with continuous features and missing data. Through an experimental study, we prove that Bagging credal decision trees outperforms more complex Bagging approaches on data sets with class noise. Finally, using a bias–variance error decomposition analysis, we also justify the performance of the method of Bagging credal decision trees, showing that it achieves a stronger reduction of the variance error component.

Introduction

In the general area of data mining, the supervised classification problem is normally defined as follows: we have a data set of observations, called the training set, and we wish to obtain a set of rules in order to assign a value of the variable to be predicted (discrete or discretized) to each new observation. To verify the quality of this set of rules, a different set of observations is used, this set is called the test set. The variable to be predicted (classified) is called class variable and the rest of variables in the data set are called predictive attributes or features. There exist important applications of classification in fields such as medicine, bioinformatics, physics, pattern recognition, economics, etc., and are used for disease diagnosis, meteorological forecasts, insurance, text classification, to name but a few.

Within a probabilistic approach, supervised classification problem is faced as an inference problem. The probability distribution of the class variable given the predictive attributes is estimated from a training data set and the quality of this estimation is then evaluated in an independent test data set. In order to estimate or learn this probability distribution, many different approaches can be employed.

The combination or ensemble of classifiers can improve the accuracy of these in supervised classification, specially for the high classification accuracy performance they offer as well as the robustness to different issues which appear in real applications such as class imbalanced problems or training data sets with a very low size. Among the different approaches to combine classification models, ensembles of decision trees are the most accepted and studied, although this approach is not just restricted to learning decision trees, it has been also applied to most other machine learning methods.

Decision trees (or classification trees) are a special family of classifiers with a simple structure and very easy to interpret. But the important aspect of decision trees which make them very suitable to be employed in ensembles of classifiers is their inherent instability. This property causes that different training datasubsets from a given problem domain will produce very different trees. This characteristic was essential to consider them as suitable classifiers in ensemble schemes such as Bagging (Breiman, 1996), Boosting (Freund & Schapire, 1996) or Random Forest (Breiman, 2001).

Classification noise is named to those situations which appear when data sets have incorrect class labels in their training and/or test data sets. There are many relevant situations in which this problem can arise due to deficiencies in the data learning and/or test capture process (wrong disease diagnosis method, human errors in the class label assignation, etc.) The performance of classifiers on data sets with classification noise is a very important issue for machine learning methods.

Many studies have been concerned with the problems related to the performance of ensembles of decision trees in noisy data domains (Dietterich, 2000, Freund and Schapire, 1996, Melville et al., 2004). These studies showed as Boosting strongly deteriorates its performance while Bagging ensembles are the most robust and outperforming ensembles in these situations. Noisy training data usually increases the variance in the predictions of the classifiers, therefore, Bagging ensembles based on variance-reducing methods work very well (Breiman, 1996).

In this study, we show as the employment of Bagging ensembles of a special type of decision trees, called credal sets, which are based on imprecise probabilities (via the Imprecise Dirichlet model Walley, 1996) and information based uncertainty measures (via the maximum of entropy function, Klir, 2006), can be a successful tool in classification problems with a high level of noise in the class variable.

This paper is organized as follows: in Section 2, we present previous knowledge necessary about decision trees and methods to ensemble decision trees; in Section 3, we focus on our method of Bagging credal decision trees; in Section 4, we present the results of experiments conducted to compare the performance of our method for Bagging credal decision trees with other Bagging scheme which uses the popular C4.5 method, on data sets with different levels of classification noise; and finally, Section 6 is devoted to the conclusions.

Section snippets

Decision trees

Decision trees (also known as classification trees or hierarchical classifiers) started to play an important role in machine learning since the publication of Quinlan’s ID3 (Quinlan, 1986). Subsequently, Quinlan also presented the algorithm C4.5 (Quinlan, 1993), which is an advanced version of ID3. Since then, C4.5 has been considered as a standard model in supervised classification. They have also been widely applied as a data analysis tool to very different fields, such as astronomy, biology,

Bagging

Breiman’s Bagging (bootstrap aggregating) (Breiman, 1996) is one of the first cases of an ensemble of decision trees. It is also the most intuitive and simple and performs very well. Diversity in Bagging is obtained by using bootstrapped replicas of the original training set: different training datasets are randomly drawn with replacement. And, subsequently, a single decision tree is built with each training data replica with the use of the standard approach (Breiman et al., 1984). Thus, each

Bagging credal decision trees

A new method which uses a split criterion based on uncertainty measures and imprecise probabilities (Imprecise Info-Gain criterion) to build simple decision trees was firstly presented in Abellán & Moral’s method (Abellán & Moral, 2003) and in a more complex procedure in Abellán and Moral (2005). In a similar way to ID3, this decision tree is only defined for discrete variables, it does not works with missing values, and it does not carry out a posterior pruning process.

In a recent work (

Experimental analysis

The experimental analysis is carried out in two main blocks. In Section 4.1, we analyze how Bagging ensembles of credal decision trees are affected by the introduction of classification noise in terms of classification accuracy and, also, we analyze how the size of the single trees varies with the different rates of noise. In Section 4.2, we employ a bias–variance error decomposition analysis to see why Bagging credal decision trees outperforms the standard approach in data sets with

Results of the accuracy

We present the results of the accuracy of each method, without and with a pruning procedure, on each data set and for each level of noise, in Table 2, Table 3 respectively.

Performance evaluation without tree post-pruning

In this subsection, we compare our approach, B-CDT, with respect to B-C4.5 in terms of classification accuracy, both without post-pruning methods. We also give the average number of nodes of the trees in the different ensembles.

In Table 4, we depict the average classification accuracy and the average size of the different trees for both ensembles and for the different noise levels. In Fig. 3, these values are graphically represented in dashed lines for Bagging-C4.5R8 ensembles and in continuous

Performance evaluation with tree post-pruning

In this subsection, we analyze the performance of both ensembles where the decision trees now apply a post-pruning method (see Sections 2 Decision trees, 3.3 Decision tree inducer).

In Table 6, we show the average classification accuracy and the average size of the trees for both ensembles. In Fig. 4 we also graphically show these values.

In Dietterich (2000), there were no conclusions about the suitability to introduce post-pruning methods for decision trees in Bagging ensembles. Our findings

Experimental analysis

The average percentage values of the decomposition of the classification error in bias and variance components for both ensembles without pruning with post-pruning methods are detailed in Table 8, Table 9, respectively. They are also graphically depicted in Fig. 5, Fig. 6.

The followings are the main facts that can be found in this first analysis:

  • As expected, the introduction of post-pruning methods increases the bias component while reduces the variance, specially in datasets with high noise

Conclusions

In this paper, we have extended a method to build decision trees based on imprecise probabilities and uncertainty measures (called credal decision trees), to one which works with continuous features and missing data. As an application of this type of decision trees, we have use them into a Bagging scheme on data sets with class noise. In our an experimental study, we have proved that our method of Bagging credal decision trees can reduce the percentage of error in classification when it is

Acknowledgments

This work has been supported by the Spanish “Consejería de Economía, Innovación y Ciencia de la Junta de Andalucía”, “Ministerio de Educación y Ciencia” and by the Spanish research programme Consolider Ingenio 2010 under Projects TIC-06016, TIN2010–20900-C04–01 and MIPRCV (CSD2007-00018), respectively.

A shorted version of this paper has been presented in FOIKS’10.

References (27)

  • J. Abellán et al.

    Upper entropy of credal sets. Applications to credal classification

    International Journal of Approximate Reasoning

    (2005)
  • J. Abellán et al.

    An ensemble method using credal decision trees

    European Journal of Operational Research

    (2010)
  • J.M. Bernard

    An introduction to the imprecise Dirichlet model for multinomial data

    International Journal of Approximate Reasoning

    (2005)
  • J. Quinlan

    Simplifying decision trees

    International Journal of Machine Learning Studies

    (1987)
  • J. Abellán

    Combining nonspecificity measures in Dempster–Shafer theory of evidence

    International Journal of General Systems

    (2011)
  • J. Abellán

    Uncertainty measures on probability intervals from Imprecise Dirichlet model

    International Journal of General Systems

    (2006)
  • J. Abellán et al.

    Maximum entropy for credal sets

    International Journal of Uncertainty Fuzziness and Knowledge-Based Systems

    (2003)
  • J. Abellán et al.

    Building classification trees using the total uncertainty criterion

    International Journal of Intelligent Systems

    (2003)
  • J. Abellán et al.

    An algorithm that computes the upper entropy for order-2 capacities

    International Journal of Uncertainty Fuzziness and Knowledge-Based Systems

    (2006)
  • J. Abellán et al.

    Disaggregated total uncertainty measure for credal sets

    International Journal of General Systems

    (2006)
  • J. Abellán et al.

    An experimental study about simple decision trees for Bagging ensemble on datasets with classification noise

  • J. Abellán et al.

    Requirements for total uncertainty measures in Dempster-Shafer theory of evidence

    International Journal of General Systems

    (2008)
  • L. Breiman et al.

    Classification and regression trees

    (1984)
  • Cited by (0)

    View full text