Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

https://doi.org/10.1016/j.ipm.2010.11.007Get rights and content

Abstract

Learning from imbalanced datasets is difficult. The insufficient information that is associated with the minority class impedes making a clear understanding of the inherent structure of the dataset. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced, because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs may suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique, which incorporates both over-sampling and under-sampling, with an ensemble of SVMs to improve the prediction performance. Extensive experiments show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

Introduction

A standard two-class classification method usually makes the simple assumption that the classes to be discriminated should have a comparable number of instances. In accordance, most current classification systems are designed to optimize the overall performance rather than considering the relative distribution of each class. However, this might be problematic in practice. Many real-world datasets are highly skewed, in which most of the cases belong to a larger class and far fewer cases belong to a smaller, yet usually more interesting class. Examples of applications with such datasets include searching for oil spills in radar images (Kubat, Holte, & Matwin, 1998), telephone fraud detection (Fawcett & Provost, 1997), credit card fraudulent detection (Chan and Stolfo, 1998, Chen et al., 2009), diagnosis of weld flaws (Warren Liao, 2008), and text categorization (Dumais et al., 1998, Ertekin et al., 2007, Stamatatos, 2008). Consequently, these systems tend to misclassify the minority class examples as majority, and lead to a high false negative rate. In such applications, the cost is usually high when a classifier misclassifies the small (positive) class instances. Take network intrusion as an instance. The number of malicious intrusions to a website is small compared with millions regular accesses every day. However, the loss of might be huge if the illegitimacy is associated with the leakage of private internal data or the disorder of website functionalities. To solve this problem, two categories of techniques have been proposed: sampling approaches and algorithm-based approaches. Generally, sampling approaches include methods that over-sample the minority class to match the size of the majority class (Guo and Viktor, 2004, Ling and Li, 1998, Solberg and Solberg, 1996), and methods that under-sample the majority class to match the size of the minority class (Chen et al., 2004, Kubat et al., 1998, Kubat and Matwin, 1997, Wilson and Martinez, 2000). Algorithm-based approaches aim at improving a classifier’s performance based on its inherent characteristics. For example, methods that are particular tailored for Decision Trees, Neural Networks (MLPs), Naive Bayes systems, etc.

This paper is concerned with improving the performance of the Support Vector Machines (SVMs) on imbalanced data sets. SVMs have gained success in many applications, such as text mining and hand-writing recognition. However, when the data is highly imbalanced, the decision boundary obtained from the training data is biased toward the minority class. Most approaches proposed to address this problem have been algorithm-based (Akbani et al., 2004, Veropoulos et al., 1999, Wu and Chang, 2004), which attempt to adjust the decision boundary through modifying the decision function, including adjusting the kernel function, changing the intercept, etc. We take a complementary approach and study the use of sampling as well as ensemble techniques to improve SVM’s performance.

First, our observation indicates that using over-sampling alone as proposed in previous work (e.g. SMOTE Akbani et al., 2004) can introduce excessive noise and lead to ambiguity along decision boundaries. We propose to integrate the two types of sampling strategies by starting with over-sampling the minority class to a moderate extent, followed by under-sampling the majority class to the similar size. This is to provide the learner with more robust training data. We show by empirical results that the proposed sampling approach outperforms over-sampling alone irrespective of the parameter selection. We further consider using an ensemble of SVMs, which we call EnSVM, to boost the performance. A collection of SVMs is trained individually on the processed data, and the final prediction is obtained by combining the results from those individual SVMs. We then show that the generalization capability of EnSVM can be further improved by retaining only a subset of the component SVM classifiers, and propose a new approach called EnSVM+, which utilizes genetic algorithms to perform classifier selection.

In summary, we make the following contributions:

  • 1.

    We carefully design a series of experiments to demonstrate that over-sampling alone could mislead the decision boundary of SVM when the data are highly skewed.

  • 2.

    We propose a novel strategy to combine two types of sampling methods, answering the call to achieve optimal performance by balancing the class distribution.

  • 3.

    We propose the ensemble of SVM (EnSVM) model to integrate the classification results of weak classifiers constructed individually on the processed data, and develop a genetic algorithm-based model called EnSVM+ to further boost the performance of classification through classifier selection. The effectiveness of the proposed models is confirmed by experiments.

The rest of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we review some basic concepts of SVMs, and investigate the effects of class imbalance problem on SVMs. Section 4 discusses how to re-balance the data, and Section 5 presents EnSVM and EnSVM+. In Section 6, we describe our benchmark data and report the experimental results, and Section 7 concludes the paper.

Section snippets

Related work

The class imbalance problem has recently attracted considerable attention in the machine learning community. Approaches to addressing this problem can be divided into two main directions: sampling approaches and algorithm-based approaches. Sampling is a popular strategy to handle the skewness as it simply re-balances the data at the data preprocessing stage; therefore can be deployed on top of many existing classification approaches (Chen et al., 2004, Guo and Viktor, 2004, Kubat et al., 1998,

Background

In this section, we first recall some background knowledge about how SVMs function as a classifier; then we demonstrate how they act in the context of class imbalance problem with empirical studies.

Re-balancing the data

We have shown that SVMs may perform well while the imbalance ratio is moderate in Section 3.2. Nonetheless, their performance could still suffer from extreme data skewness. To cope with this problem, in this section, we study the use of sampling techniques to balance the data.

Ensemble of SVMs

Recently, ensemble techniques have been applied in a broad spectrum of scenarios in order to improve the performance of weak classifiers. The basic idea is to first train a collection of learners independently, and then combine the individual output of each learner to obtain the final output. The rationale is that by aggregating results from the individual learners, the noise resulting from bootstrapping at random will be reduced, and the ensemble is expected to be more robust than each of the

Empirical evaluation

In this section, we first introduce the evaluation measures used in our study, and then describe the datasets. After that, we report the experimental results that compare our proposed approach with other methods.

Conclusions

This paper introduces a new approach to learning from imbalanced datasets through making an ensemble of SVM classifiers and combining both over-sampling and under-sampling techniques. We first show in this study that using SVMs for class prediction can be influenced by the data imbalance, although SVMs can adjust itself well to some degree of data imbalance. To cope with the problem, re-balancing the data is a promising direction, but both under-sampling and over-sampling have limitations. In

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC 60903108), the Program for New Century Excellent Talents in University (NCET-10-0532), Communications and Information Technology Ontario (CITO), and Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC).

References (31)

  • Drummond, C., & Holte, R. C., 2003. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats...
  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text...
  • Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: Active learning in imbalanced data...
  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository. Irvine, CA: University of California, School of...
  • T. Fawcett et al.

    Adaptive fraud detection

    Data Mining and Knowledge Discovery

    (1997)
  • Cited by (130)

    View all citing articles on Scopus
    View full text