Combining integrated sampling with SVM ensembles for learning from imbalanced datasets
Introduction
A standard two-class classification method usually makes the simple assumption that the classes to be discriminated should have a comparable number of instances. In accordance, most current classification systems are designed to optimize the overall performance rather than considering the relative distribution of each class. However, this might be problematic in practice. Many real-world datasets are highly skewed, in which most of the cases belong to a larger class and far fewer cases belong to a smaller, yet usually more interesting class. Examples of applications with such datasets include searching for oil spills in radar images (Kubat, Holte, & Matwin, 1998), telephone fraud detection (Fawcett & Provost, 1997), credit card fraudulent detection (Chan and Stolfo, 1998, Chen et al., 2009), diagnosis of weld flaws (Warren Liao, 2008), and text categorization (Dumais et al., 1998, Ertekin et al., 2007, Stamatatos, 2008). Consequently, these systems tend to misclassify the minority class examples as majority, and lead to a high false negative rate. In such applications, the cost is usually high when a classifier misclassifies the small (positive) class instances. Take network intrusion as an instance. The number of malicious intrusions to a website is small compared with millions regular accesses every day. However, the loss of might be huge if the illegitimacy is associated with the leakage of private internal data or the disorder of website functionalities. To solve this problem, two categories of techniques have been proposed: sampling approaches and algorithm-based approaches. Generally, sampling approaches include methods that over-sample the minority class to match the size of the majority class (Guo and Viktor, 2004, Ling and Li, 1998, Solberg and Solberg, 1996), and methods that under-sample the majority class to match the size of the minority class (Chen et al., 2004, Kubat et al., 1998, Kubat and Matwin, 1997, Wilson and Martinez, 2000). Algorithm-based approaches aim at improving a classifier’s performance based on its inherent characteristics. For example, methods that are particular tailored for Decision Trees, Neural Networks (MLPs), Naive Bayes systems, etc.
This paper is concerned with improving the performance of the Support Vector Machines (SVMs) on imbalanced data sets. SVMs have gained success in many applications, such as text mining and hand-writing recognition. However, when the data is highly imbalanced, the decision boundary obtained from the training data is biased toward the minority class. Most approaches proposed to address this problem have been algorithm-based (Akbani et al., 2004, Veropoulos et al., 1999, Wu and Chang, 2004), which attempt to adjust the decision boundary through modifying the decision function, including adjusting the kernel function, changing the intercept, etc. We take a complementary approach and study the use of sampling as well as ensemble techniques to improve SVM’s performance.
First, our observation indicates that using over-sampling alone as proposed in previous work (e.g. SMOTE Akbani et al., 2004) can introduce excessive noise and lead to ambiguity along decision boundaries. We propose to integrate the two types of sampling strategies by starting with over-sampling the minority class to a moderate extent, followed by under-sampling the majority class to the similar size. This is to provide the learner with more robust training data. We show by empirical results that the proposed sampling approach outperforms over-sampling alone irrespective of the parameter selection. We further consider using an ensemble of SVMs, which we call EnSVM, to boost the performance. A collection of SVMs is trained individually on the processed data, and the final prediction is obtained by combining the results from those individual SVMs. We then show that the generalization capability of EnSVM can be further improved by retaining only a subset of the component SVM classifiers, and propose a new approach called EnSVM+, which utilizes genetic algorithms to perform classifier selection.
In summary, we make the following contributions:
- 1.
We carefully design a series of experiments to demonstrate that over-sampling alone could mislead the decision boundary of SVM when the data are highly skewed.
- 2.
We propose a novel strategy to combine two types of sampling methods, answering the call to achieve optimal performance by balancing the class distribution.
- 3.
We propose the ensemble of SVM (EnSVM) model to integrate the classification results of weak classifiers constructed individually on the processed data, and develop a genetic algorithm-based model called EnSVM+ to further boost the performance of classification through classifier selection. The effectiveness of the proposed models is confirmed by experiments.
The rest of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we review some basic concepts of SVMs, and investigate the effects of class imbalance problem on SVMs. Section 4 discusses how to re-balance the data, and Section 5 presents EnSVM and EnSVM+. In Section 6, we describe our benchmark data and report the experimental results, and Section 7 concludes the paper.
Section snippets
Related work
The class imbalance problem has recently attracted considerable attention in the machine learning community. Approaches to addressing this problem can be divided into two main directions: sampling approaches and algorithm-based approaches. Sampling is a popular strategy to handle the skewness as it simply re-balances the data at the data preprocessing stage; therefore can be deployed on top of many existing classification approaches (Chen et al., 2004, Guo and Viktor, 2004, Kubat et al., 1998,
Background
In this section, we first recall some background knowledge about how SVMs function as a classifier; then we demonstrate how they act in the context of class imbalance problem with empirical studies.
Re-balancing the data
We have shown that SVMs may perform well while the imbalance ratio is moderate in Section 3.2. Nonetheless, their performance could still suffer from extreme data skewness. To cope with this problem, in this section, we study the use of sampling techniques to balance the data.
Ensemble of SVMs
Recently, ensemble techniques have been applied in a broad spectrum of scenarios in order to improve the performance of weak classifiers. The basic idea is to first train a collection of learners independently, and then combine the individual output of each learner to obtain the final output. The rationale is that by aggregating results from the individual learners, the noise resulting from bootstrapping at random will be reduced, and the ensemble is expected to be more robust than each of the
Empirical evaluation
In this section, we first introduce the evaluation measures used in our study, and then describe the datasets. After that, we report the experimental results that compare our proposed approach with other methods.
Conclusions
This paper introduces a new approach to learning from imbalanced datasets through making an ensemble of SVM classifiers and combining both over-sampling and under-sampling techniques. We first show in this study that using SVMs for class prediction can be influenced by the data imbalance, although SVMs can adjust itself well to some degree of data imbalance. To cope with the problem, re-balancing the data is a promising direction, but both under-sampling and over-sampling have limitations. In
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NSFC 60903108), the Program for New Century Excellent Talents in University (NCET-10-0532), Communications and Information Technology Ontario (CITO), and Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC).
References (31)
- et al.
Mining the customer credit using hybrid support vector machine technique
Expert Systems with Applications
(2009) Author identification: Using text sampling to handle the class imbalance problem
Information Processing & Management
(2008)- et al.
On strategies for imbalanced text classification using svm: A comparative study
Decision Support Systems
(2009) - et al.
Ensembling neural networks: Many could be better than all
Artificial Intelligence
(2002) - Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In ECML (pp....
A tutorial on support vector machines for pattern recognition
Data Mining and Knowledge Discovery
(1998)- et al.
Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection
Knowledge Discovery and Data Mining
(1998) - et al.
Smote: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
(2002) - Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). Smoteboost: Improving prediction of the minority...
- et al.
Using random forest to learn imbalanced data
(2004)
Adaptive fraud detection
Data Mining and Knowledge Discovery
Cited by (130)
Attention-enhanced conditional-diffusion-based data synthesis for data augmentation in machine fault diagnosis
2024, Engineering Applications of Artificial IntelligenceA bagging-based selective ensemble model for churn prediction on imbalanced data
2023, Expert Systems with ApplicationsMinimally overfitted learners: A general framework for ensemble learning
2022, Knowledge-Based Systems