Elsevier

Knowledge-Based Systems

Volume 38, January 2013, Pages 85-104
Knowledge-Based Systems

A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets

https://doi.org/10.1016/j.knosys.2012.08.025Get rights and content

Abstract

Lots of real world applications appear to be a matter of classification with imbalanced data-sets. This problem arises when the number of instances from one class is quite different to the number of instances from the other class. Traditionally, classification algorithms are unable to correctly deal with this issue as they are biased towards the majority class. Therefore, algorithms tend to misclassify the minority class which usually is the most interesting one for the application that is being sorted out.

Among the available learning approaches, fuzzy rule-based classification systems have obtained a good behavior in the scenario of imbalanced data-sets. In this work, we focus on some modifications to further improve the performance of these systems considering the usage of information granulation. Specifically, a positive synergy between data sampling methods and algorithmic modifications is proposed, creating a genetic programming approach that uses linguistic variables in a hierarchical way. These linguistic variables are adapted to the context of the problem with a genetic process that combines rule selection with the adjustment of the lateral position of the labels based on the 2-tuples linguistic model.

An experimental study is carried out over highly imbalanced and borderline imbalanced data-sets which is completed by a statistical comparative analysis. The results obtained show that the proposed model outperforms several fuzzy rule based classification systems, including a hierarchical approach and presents a better behavior than the C4.5 decision tree.

Introduction

Learning from imbalanced data-sets is an issue that has attracted a lot of attention in machine learning research [29], [51]. This problem is characterized by a class distribution where the number of examples in one class is outnumbered by the number of examples in the other class. The presence of imbalanced data-sets is dominant in a high number of real problems including, but not limited to, medical diagnosis, fraud detection, finances, risk management, network intrusion and so on. Additionally, the positive or minority class is usually the one that has the highest interest from the learning point of view and it also implies a great cost when it is not well classified [17], [57].

A standard classifier that seeks accuracy over a full range of instances is frequently not suitable to deal with imbalanced learning tasks, since it tends to be overwhelmed by the majority class thus misclassifying the minority examples. This situation becomes critical when the minority class is greatly outnumbered by the majority class, generating an scenario of highly imbalanced data-sets where the performance deterioration is amplified. However, some studies have shown that imbalance for itself is not the only factor that hinders the classification performance [37]. There are several data intrinsic characteristics which lower the learning effectiveness. Some of these handicaps within the data are the presence of small disjuncts [53], the overlap between the classes [26] or the existence of noisy [49] and borderline [44] samples. There is no need to say that when the classification data share an skewed data distribution together with any of the aforementioned situations, the performance degradation is intensified [19], [42], [53].

A large number of approaches have been proposed to deal with the class imbalance problem. Those solutions fall largely into two major categories. The first is data sampling in which the training data distribution is modified to obtain a set with a balanced distribution. Standard classifiers are thus helped to obtain a correct identification of data [9], [6]. The second is through algorithmic modification where the base learning methods are modified to consider the imbalanced distribution of the data. In this manner, base learning methods change some of its internal operations accordingly [57].

Fuzzy Rule-Based Classification Systems (FRBCSs) [34] are useful and well-known tools in the machine learning framework. They provide a good trade-off between the empirical precision of traditional engineering techniques and the interpretability achieved through the use of linguistic labels whose semantic is close to the natural language. Specifically, recent works have shown that FRBCSs have a good behavior dealing with imbalanced data-sets by means of the application of instance preprocessing techniques [20].

The hybridization between fuzzy logic and genetic algorithms leading to Genetic Fuzzy Systems (GFSs) [12], [30] is one of the most popular approaches used when different computational intelligence techniques are combined. A GFS is basically a fuzzy system augmented by a learning process based on evolutionary computation. Among evolutionary algorithms, Genetic Programming (GP) [39] is a development of classical genetic algorithms that evolve tree-shaped solutions using variable length chromosomes. GP has been used in FRBCSs to learn fuzzy rule bases [7] profitting from its high expressive power and flexibility.

However, the disadvantage of FRBCSs is the inflexibility of the concept of linguistic variable because it imposes hard restrictions on the fuzzy rule structure [5] which may suppose a loss in accuracy when dealing with some complex systems, such as high dimensional problems, the presence of noise or overlapped classes. Many different possibilities to enhance the linguistic fuzzy modeling have been considered in the specialized literature. All of these approaches share the common idea of improving the way in which the linguistic fuzzy model performs the interpolative reasoning by inducing a better cooperation among the rules in the Knowledge Base (KB). This rule cooperation may be induced acting on three different model components:

  • Approaches acting on the whole KB. This includes the KB derivation [43] and a hierarchical linguistic rule learning [14].

  • Approaches acting on the Rule Base (RB). The most common approach is rule selection [35] but also multiple rule consequent learning [11] could be considered.

  • Approaches acting on the Data Base (DB). For example a priori granularity learning [13] or membership function tuning [1].

In this work, we present a procedure to obtain an Hierarchical Fuzzy Rule Based Classification System (HFRBCS) to deal with imbalanced data-sets. In order to do so, this model introduces modifications both at the data and algorithm level. This procedure is divided into three different steps:

  • 1.

    A preprocessing technique, the Synthetic Minority Over-sampling Technique (SMOTE) [9], is used to balance the distribution of training examples in both classes.

  • 2.

    A hierarchical knowledge base (HKB) [14] is generated, using the GP-COACH (Genetic Programming-based learning of COmpact and ACcurate fuzzy rule-based classification systems for High-dimensional problems) algorithm [7] to build the RB. The GP-COACH algorithm has been modified to extend a classical KB into a HKB, integrating a rule expansion process to create high granularity rules in each generation of the algorithm. The usage of a HKB implies an adaptation of the components to allow the interaction of the different granularities in the RB population.

  • 3.

    A post-processing step involving rule selection and the application of the 2-tuples based genetic tuning is applied to improve the overall performance.

The combination of these steps constitutes a convenient approach to solve the problem of classification with imbalanced data-sets. First of all, the preprocessing technique compensates the number of instances for each class easing the learning process for the consequent procedures. Then, the step to learn the HKB is used to address the imbalanced problem together with some of the data intrinsic characteristics that difficult the learning. This HKB process is appropriate because it increases the accuracy by reinforcing those problem subspaces that are specially difficult in this environment, such as borderline instances [44], small disjuncts [37] or overlapping regions [26]. Finally, the post-processing step refines the results achieved by the previous process. The integration of these schemes completes our proposal, which will be denoted as GP-COACH-H (GP-COACH Hierarchical).

We will focus on two difficult situations in the scenario of imbalanced data, such as highly imbalanced and borderline imbalanced classification problems. For that, we have selected a benchmark of 44 and 30 problems respectively from KEEL data-set repository1 [2]. We will perform our experimental analysis focusing on the precision of the models using the Geometric Mean of the true rates (GM) [4]. This study will be carried out using non-parametric tests to check whether there are significant differences among the obtained results [25].

This work is structured in the following way. First, Section 2 presents an introduction of classification with imbalanced problems, describing its features, the SMOTE algorithm and the metrics that are used in this framework. Next, Section 3 introduces the proposed approach. Sections 4 Experimental framework, 5 Experimental study describe the experimental framework used and the analysis of results, respectively. Next, the conclusions achieved in this work are shown in Section 6. Finally, we include an appendix with the detailed results for the experiments performed in the experimental study.

Section snippets

Imbalanced data-sets in classification

In this section we delimit the context in which this work is content, briefly introducing the problem of imbalanced classification. Then, we will describe the preprocessing technique that we have applied in order to deal with the imbalanced data-sets: the SMOTE algorithm [9]. We finish this section describing the evaluation metrics that are used in this specific problem with respect to the most common ones in classification.

The hierarchical genetic programming fuzzy rule based classification system with rule selection and tuning (GP-COACH-H)

In this section, we will describe our proposal to obtain a hierarchical FRBCS through the usage of GP and applying rule selection together with 2-tuples lateral tuning, denoted as GP-COACH-H. This proposal is defined through its components in the following way: Section 3.1 presents a brief introduction of FRBCSs in order to contextualize the algorithm; next, Section 3.2 describes the GP-COACH algorithm [7] which is the linguistic rule generation method based on GP that we have used as base for

Experimental framework

In this section, we present the set up of the experimental framework used to develop the analysis of our proposal. First we introduce the algorithms selected for the comparison with the proposed approach and their configuration parameters (subSection 4.1). Next, we provide details of the problems chosen for the experimentation (subSection 4.2). Finally, we present the statistical tests applied to compare the results obtained with the different classifiers (subSection 4.3).

Experimental study

In this section, we present a set of experiments to illustrate and demonstrate the behavior of GP-COACH-H. These experiments are designed towards two objectives: to exemplify how the GP-COACH-H algorithm works, and to determine its robustness for highly and borderline imbalanced data-sets.

We organize those experiments in the following way. First, Section 5.1 presents a case of study over one one of the highly imbalanced data-sets presented in the previous section. Next, Section 5.2 contains an

Concluding remarks

In this paper we have presented a FRBCS with different granulation levels that integrates rule selection and the 2-tuples tuning approach to improve the performance in imbalanced data-sets. The proposal integrates data sampling together with algorithm modifications to the basic approach and adapts its behavior to the different granulation levels considered, adding a post-processing step that helps the hierarchical fuzzy rule base classification system to have a better adaptation to the context

Acknowledgments

This work has been supported by the Spanish Ministry of Science and Technology under Projects TIN2011-28488 and TIN2008-06681-C06-02, and the Andalusian Research Plan P10-TIC-6858 and TIC-3928. V. López holds a FPU scholarship from Spanish Ministry of Education.

References (57)

  • J. Liu et al.

    A comparative study on rough set based class imbalance learning

    Knowledge-Based Systems

    (2008)
  • V. López et al.

    Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

    Expert Systems with Applications

    (2012)
  • L. Magdalena et al.

    A fuzzy logic controller with learning through the evolution of its knowledge base

    International Journal of Approximate Reasoning

    (1997)
  • J. Ren

    ANN vs. SVM: which one performs better in classification of MCCs in mammogram imaging

    Knowledge-Based Systems

    (2012)
  • R. Alcalá et al.

    A proposal for the genetic lateral tuning of linguistic fuzzy systems and its interaction with rule selection

    IEEE Transactions on Fuzzy Systems

    (2007)
  • J. Alcalá-Fdez et al.

    KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework

    Journal of Multi-Valued Logic and Soft Computing

    (2011)
  • J. Alcalá-Fdez et al.

    KEEL: a software tool to assess evolutionary algorithms for data mining problems

    Soft Computing

    (2009)
  • A. Bastian

    How to handle the flexibility of linguistic variables with applications

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

    (1994)
  • G.E.A.P.A. Batista et al.

    A study of the behaviour of several methods for balancing machine learning training data

    SIGKDD Explorations

    (2004)
  • J. Charles

    Automatic recognition of complete palynomorphs in digital images

    Machine Vision and Applications

    (2011)
  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    Journal of Artificial Intelligent Research

    (2002)
  • Z. Chi et al.

    Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition

    (1996)
  • O. Cordón et al.

    A proposal for improving the accuracy of linguistic modeling

    IEEE Transactions on Fuzzy Systems

    (2000)
  • O. Cordón et al.

    Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases

    (2001)
  • O. Cordón et al.

    Generating the knowledge base of a fuzzy rule-based system by the genetic learning of the data base

    IEEE Transactions on Fuzzy Systems

    (2001)
  • O. Cordon et al.

    Linguistic modeling by hierarchical systems of linguistic rules

    IEEE Transactions on Fuzzy Systems

    (2002)
  • C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in:...
  • C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th IEEE International Joint Conference...
  • Cited by (0)

    View full text