A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets
Introduction
Learning from imbalanced data-sets is an issue that has attracted a lot of attention in machine learning research [29], [51]. This problem is characterized by a class distribution where the number of examples in one class is outnumbered by the number of examples in the other class. The presence of imbalanced data-sets is dominant in a high number of real problems including, but not limited to, medical diagnosis, fraud detection, finances, risk management, network intrusion and so on. Additionally, the positive or minority class is usually the one that has the highest interest from the learning point of view and it also implies a great cost when it is not well classified [17], [57].
A standard classifier that seeks accuracy over a full range of instances is frequently not suitable to deal with imbalanced learning tasks, since it tends to be overwhelmed by the majority class thus misclassifying the minority examples. This situation becomes critical when the minority class is greatly outnumbered by the majority class, generating an scenario of highly imbalanced data-sets where the performance deterioration is amplified. However, some studies have shown that imbalance for itself is not the only factor that hinders the classification performance [37]. There are several data intrinsic characteristics which lower the learning effectiveness. Some of these handicaps within the data are the presence of small disjuncts [53], the overlap between the classes [26] or the existence of noisy [49] and borderline [44] samples. There is no need to say that when the classification data share an skewed data distribution together with any of the aforementioned situations, the performance degradation is intensified [19], [42], [53].
A large number of approaches have been proposed to deal with the class imbalance problem. Those solutions fall largely into two major categories. The first is data sampling in which the training data distribution is modified to obtain a set with a balanced distribution. Standard classifiers are thus helped to obtain a correct identification of data [9], [6]. The second is through algorithmic modification where the base learning methods are modified to consider the imbalanced distribution of the data. In this manner, base learning methods change some of its internal operations accordingly [57].
Fuzzy Rule-Based Classification Systems (FRBCSs) [34] are useful and well-known tools in the machine learning framework. They provide a good trade-off between the empirical precision of traditional engineering techniques and the interpretability achieved through the use of linguistic labels whose semantic is close to the natural language. Specifically, recent works have shown that FRBCSs have a good behavior dealing with imbalanced data-sets by means of the application of instance preprocessing techniques [20].
The hybridization between fuzzy logic and genetic algorithms leading to Genetic Fuzzy Systems (GFSs) [12], [30] is one of the most popular approaches used when different computational intelligence techniques are combined. A GFS is basically a fuzzy system augmented by a learning process based on evolutionary computation. Among evolutionary algorithms, Genetic Programming (GP) [39] is a development of classical genetic algorithms that evolve tree-shaped solutions using variable length chromosomes. GP has been used in FRBCSs to learn fuzzy rule bases [7] profitting from its high expressive power and flexibility.
However, the disadvantage of FRBCSs is the inflexibility of the concept of linguistic variable because it imposes hard restrictions on the fuzzy rule structure [5] which may suppose a loss in accuracy when dealing with some complex systems, such as high dimensional problems, the presence of noise or overlapped classes. Many different possibilities to enhance the linguistic fuzzy modeling have been considered in the specialized literature. All of these approaches share the common idea of improving the way in which the linguistic fuzzy model performs the interpolative reasoning by inducing a better cooperation among the rules in the Knowledge Base (KB). This rule cooperation may be induced acting on three different model components:
- •
Approaches acting on the whole KB. This includes the KB derivation [43] and a hierarchical linguistic rule learning [14].
- •
Approaches acting on the Rule Base (RB). The most common approach is rule selection [35] but also multiple rule consequent learning [11] could be considered.
- •
Approaches acting on the Data Base (DB). For example a priori granularity learning [13] or membership function tuning [1].
In this work, we present a procedure to obtain an Hierarchical Fuzzy Rule Based Classification System (HFRBCS) to deal with imbalanced data-sets. In order to do so, this model introduces modifications both at the data and algorithm level. This procedure is divided into three different steps:
- 1.
A preprocessing technique, the Synthetic Minority Over-sampling Technique (SMOTE) [9], is used to balance the distribution of training examples in both classes.
- 2.
A hierarchical knowledge base (HKB) [14] is generated, using the GP-COACH (Genetic Programming-based learning of COmpact and ACcurate fuzzy rule-based classification systems for High-dimensional problems) algorithm [7] to build the RB. The GP-COACH algorithm has been modified to extend a classical KB into a HKB, integrating a rule expansion process to create high granularity rules in each generation of the algorithm. The usage of a HKB implies an adaptation of the components to allow the interaction of the different granularities in the RB population.
- 3.
A post-processing step involving rule selection and the application of the 2-tuples based genetic tuning is applied to improve the overall performance.
The combination of these steps constitutes a convenient approach to solve the problem of classification with imbalanced data-sets. First of all, the preprocessing technique compensates the number of instances for each class easing the learning process for the consequent procedures. Then, the step to learn the HKB is used to address the imbalanced problem together with some of the data intrinsic characteristics that difficult the learning. This HKB process is appropriate because it increases the accuracy by reinforcing those problem subspaces that are specially difficult in this environment, such as borderline instances [44], small disjuncts [37] or overlapping regions [26]. Finally, the post-processing step refines the results achieved by the previous process. The integration of these schemes completes our proposal, which will be denoted as GP-COACH-H (GP-COACH Hierarchical).
We will focus on two difficult situations in the scenario of imbalanced data, such as highly imbalanced and borderline imbalanced classification problems. For that, we have selected a benchmark of 44 and 30 problems respectively from KEEL data-set repository1 [2]. We will perform our experimental analysis focusing on the precision of the models using the Geometric Mean of the true rates (GM) [4]. This study will be carried out using non-parametric tests to check whether there are significant differences among the obtained results [25].
This work is structured in the following way. First, Section 2 presents an introduction of classification with imbalanced problems, describing its features, the SMOTE algorithm and the metrics that are used in this framework. Next, Section 3 introduces the proposed approach. Sections 4 Experimental framework, 5 Experimental study describe the experimental framework used and the analysis of results, respectively. Next, the conclusions achieved in this work are shown in Section 6. Finally, we include an appendix with the detailed results for the experiments performed in the experimental study.
Section snippets
Imbalanced data-sets in classification
In this section we delimit the context in which this work is content, briefly introducing the problem of imbalanced classification. Then, we will describe the preprocessing technique that we have applied in order to deal with the imbalanced data-sets: the SMOTE algorithm [9]. We finish this section describing the evaluation metrics that are used in this specific problem with respect to the most common ones in classification.
The hierarchical genetic programming fuzzy rule based classification system with rule selection and tuning (GP-COACH-H)
In this section, we will describe our proposal to obtain a hierarchical FRBCS through the usage of GP and applying rule selection together with 2-tuples lateral tuning, denoted as GP-COACH-H. This proposal is defined through its components in the following way: Section 3.1 presents a brief introduction of FRBCSs in order to contextualize the algorithm; next, Section 3.2 describes the GP-COACH algorithm [7] which is the linguistic rule generation method based on GP that we have used as base for
Experimental framework
In this section, we present the set up of the experimental framework used to develop the analysis of our proposal. First we introduce the algorithms selected for the comparison with the proposed approach and their configuration parameters (subSection 4.1). Next, we provide details of the problems chosen for the experimentation (subSection 4.2). Finally, we present the statistical tests applied to compare the results obtained with the different classifiers (subSection 4.3).
Experimental study
In this section, we present a set of experiments to illustrate and demonstrate the behavior of GP-COACH-H. These experiments are designed towards two objectives: to exemplify how the GP-COACH-H algorithm works, and to determine its robustness for highly and borderline imbalanced data-sets.
We organize those experiments in the following way. First, Section 5.1 presents a case of study over one one of the highly imbalanced data-sets presented in the previous section. Next, Section 5.2 contains an
Concluding remarks
In this paper we have presented a FRBCS with different granulation levels that integrates rule selection and the 2-tuples tuning approach to improve the performance in imbalanced data-sets. The proposal integrates data sampling together with algorithm modifications to the basic approach and adapts its behavior to the different granulation levels considered, adding a post-processing step that helps the hierarchical fuzzy rule base classification system to have a better adaptation to the context
Acknowledgments
This work has been supported by the Spanish Ministry of Science and Technology under Projects TIN2011-28488 and TIN2008-06681-C06-02, and the Andalusian Research Plan P10-TIC-6858 and TIC-3928. V. López holds a FPU scholarship from Spanish Ministry of Education.
References (57)
- et al.
Strategies for learning in class imbalance problems
Pattern Recognition
(2003) - et al.
GP-COACH: genetic programming-based learning of compact and accurate fuzzy rule-based classification systems for high-dimensional problems
Information Sciences
(2010) - et al.
A proposal on reasoning methods in fuzzy rule-based classification systems
International Journal of Approximate
(1999) - et al.
A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets
Fuzzy Sets and Systems
(2008) - et al.
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets
International Journal of Approximate Reasoning
(2009) - et al.
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Information Sciences
(2010) - et al.
Evolutionary-based selection of generalized instances for imbalanced classification
Knowledge-Based Systems
(2012) - et al.
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance
Knowledge-Based Systems
(2012) - et al.
Class imbalance methods for translation initiation site recognition in DNA sequences
Knowledge-Based Systems
(2012) - et al.
Iterative boolean combination of classifiers in the ROC space: an application to anomaly detection with HMMs
Pattern Recognition
(2010)