1 Introduction
2 Related Work
2.1 Resampling Techniques
2.2 Four Types of Samples in the Imbalance Learning Domain
Type | Safe (S) | Borderline (B) | Rare (R) | Outlier (O) |
---|---|---|---|---|
Rule | \(\frac{k+1}{2k} < R_{\frac{min}{all}} \leqslant 1 \) | \(\frac{k-1}{2k} \leqslant R_{\frac{min}{all}} \leqslant \frac{k+1}{2k}\) | \(0< R_{\frac{min}{all}} < \frac{k-1}{2k}\) | \(R_{\frac{min}{all}} = 0\) |
E.G. given the neighbourhood of a fixed size \(k=5\) | ||||
Rule | \(\frac{3}{5} < R_{\frac{min}{all}} \leqslant 1 \) | \(\frac{2}{5} \leqslant R_{\frac{min}{all}} \leqslant \frac{3}{5}\) | \(0< R_{\frac{min}{all}} < \frac{2}{5}\) | \(R_{\frac{min}{all}} = 0\) |
2.3 Outlier Score
3 Experimental Setup
PyOD
[20].Datasets | #Attributes | #Samples | Imbalance ratio (IR) |
---|---|---|---|
glass1 | 9 | 214 | 1.82 |
ecoli4 | 7 | 336 | 15.8 |
vehicle1 | 18 | 846 | 2.9 |
yeast4 | 8 | 1484 | 28.1 |
wine quality | 11 | 1599 | 29.17 |
page block | 10 | 5472 | 8.79 |
4 Experimental Results and Discussion
SVM
and Decision Tree
as the base classifiers in our experiments to compare the performance of the proposed method and the existing methods. Please note that we did not tune the hyperparameters for the classification algorithms and the resampling techniques
[9]. The experimental results with the two additional attributes (four types of samples and LOF score) are presented in Table 3. We can observe that introducing outlier score and four types of samples as additional attributes can significantly improve the imbalanced classification performance in most cases. For 5 out of 7 datasets (2D chess dataset, glass1, yeast4, wine quality and page block), only introducing additional attributes (with no resampling) gives better results than performing resampling techniques.SVM
and Decision Tree
. “Add = YES” means we introduce the two additional attributes to the original datasets; gray cells indicate that the proposed method (Add = YES) significantly outperforms the existing methods (Add = NO); “—” means that TP + FN = 0 or TP + FP = 0 and the performance metric cannot be computed.Decision Tree
) is also analysed in order to get an additional insight into the usefulness of the new attributes. Detailed importance score of each attribute is shown in Table 4. According to the feature importance analysis, we can conclude that the introduced “four types of samples” attribute plays an important role in the decision tree classification process for all datasets in our experiment. For 3 out of 7 datasets, the introduced “outlier score” attribute provides useful information during the classification process. The conclusions above show that the two introduced attributes are actually used in the decision process and the “four types of samples” attribute is more important than the “outlier score” attribute.Decision Tree
. The higher the “score” is, the more the feature contributes during the classification; “org” indicates the original attribute while “add” indicates the added attribute; grey cells indicate the three most useful attributes (after adding the two proposed attributes) in the decision tree classification process.