Skip to main content
Erschienen in:
Buchtitelbild

Open Access 2020 | OriginalPaper | Buchkapitel

Improving Imbalanced Classification by Anomaly Detection

verfasst von : Jiawen Kong, Wojtek Kowalczyk, Stefan Menzel, Thomas Bäck

Erschienen in: Parallel Problem Solving from Nature – PPSN XVI

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Although the anomaly detection problem can be considered as an extreme case of class imbalance problem, very few studies consider improving class imbalance classification with anomaly detection ideas. Most data-level approaches in the imbalanced learning domain aim to introduce more information to the original dataset by generating synthetic samples. However, in this paper, we gain additional information in another way, by introducing additional attributes. We propose to introduce the outlier score and four types of samples (safe, borderline, rare, outlier) as additional attributes in order to gain more information on the data characteristics and improve the classification performance. According to our experimental results, introducing additional attributes can improve the imbalanced classification performance in most cases (6 out of 7 datasets). Further study shows that this performance improvement is mainly contributed by a more accurate classification in the overlapping region of the two classes (majority and minority classes). The proposed idea of introducing additional attributes is simple to implement and can be combined with resampling techniques and other algorithmic-level approaches in the imbalanced learning domain.
Hinweise
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 766186 (ECOLE).

1 Introduction

The imbalanced classification problem has caught growing attention from many fields. In the field of computational design optimization, product parameters are modified to generate digital prototypes and the performances are usually evaluated by numerical simulations which often require minutes to hours of computation time. Here, some parameter variations (minority number of designs) would result in valid and producible geometries but violate given constraints in the final steps of the optimization. Under this circumstance, performing proper imbalanced classification algorithms on the design parameters could save computation time. In the imbalanced learning domain, many techniques have proven to be efficient in handling imbalanced datasets, including resampling techniques and algorithmic-level approaches [5, 9, 15], where the former aims to produce balanced datasets and the latter aims to make classical classification algorithms appropriate for handling imbalanced datasets. The resampling techniques are standard techniques in imbalance learning since they are simple and easily configurable and can be used in synergy with other learning algorithms [4]. The main idea of most oversampling approaches is to introduce more information to the original dataset by creating synthetic samples. However, very few studies consider the idea of introducing additional attributes to the imbalanced dataset.
The anomaly detection problem can be considered as an extreme case of the class imbalance problem. In this paper, we propose to improve the imbalanced classification with some anomaly detection techniques. We propose to introduce the outlier score, which is an important indicator to evaluate whether a sample is an outlier [2], as an additional attribute of the original imbalanced datasets. Apart from this, we also introduce the four types of samples (safe, borderline, rare and outlier), which have been emphasized in many studies [14, 16], as another additional attribute. In our experiments, we consider four scenarios, i.e. four different combinations using the additional attributes and performing resampling techniques. The results of our experiments demonstrate that introducing the two proposed additional attributes can improve the imbalanced classification performance in most cases. Further study shows that this performance improvement is mainly contributed by a more accurate classification in the overlapping region of the two classes (majority and minority classes).
The remainder of this paper is organized as follows. In Sect. 2, the research related to our work is presented, also including the relevant background knowledge on four resampling approaches, outlier score and the four types of samples. In Sect. 3, the experimental setup is introduced in order to understand how the results are generated. Section 4 gives the results and further discussion of our experiments. Section 5 concludes the paper and outlines further research.
As mentioned in the Introduction, we propose to introduce two additional attributes into the imbalanced datasets in order to gain more information on the data characteristics and improve the classification performance. Introducing additional attributes can be regarded as a data preprocessing method, which is independent of resampling techniques and algorithmic-level approaches, and can also be combined with these two approaches. In this section, the background knowledge related to our experiment is given, including resampling techniques (Sect. 2.1), the definition of four types of samples in the imbalance learning domain (Sect. 2.3) and the outlier score (Sect. 2.2).

2.1 Resampling Techniques

In the following, we introduce two oversampling techniques (SMOTE and ADASYN) and two undersampling techniques (NCL and OSS).
Oversampling Techniques. The synthetic minority oversampling technique (SMOTE) is the most famous resampling technique [3]. SMOTE produces synthetic minority samples based on the randomly chosen minority samples and their K-nearest neighbours. The new synthetic sample can be generated by interpolation between the selected minority sample and one of its K-nearest neighbours. The main improvement in the adaptive synthetic (ADASYN) sampling technique is that the samples which are harder to learn are given higher importance and will be oversampled more often in ADASYN [7].
Undersampling Techniques. One-Sided Selection (OSS) is an undersampling technique which combines Tomek Links and the Condensed Nearest Neighbour (CNN) Rule [4, 11]. In OSS, noisy and borderline majority samples are removed with so-called Tomek links [17]. The safe majority samples which have limited contribution for building the decision boundary are then removed with CNN. Neighbourhood Cleaning Rule (NCL) emphasizes the quality of the retained majority class samples after data cleaning [12]. The cleaning process is first performed by removing ambiguous majority samples through Wilson’s Edited Nearest Neighbour Rule (ENN) [19]. Then, the majority samples which have different labels from their three nearest neighbours are removed. Apart from this, if a minority sample has different labels from its three nearest neighbours, then the three neighbours are removed.

2.2 Four Types of Samples in the Imbalance Learning Domain

Napierala and Stefanowski proposed to analyse the local characteristics of minority examples by dividing them into four different types: safe, borderline, rare examples and outliers [14]. The identification of the type of an example can be done through modeling its k-neighbourhood. Considering that many applications involve both nominal and continuous attributes, the HVDM metric (Appendix A) is applied to calculate the distance between different examples. Given the number of neighbours k (odd), the label to a minority example can be assigned through the ratio of the number of its minority neighbours to the total number of neighbours (\(R_{\frac{min}{all}}\)) according to Table 1. The label for a majority example can be assigned in a similar way.
Table 1.
Rules to assign the four types of minority examples.
Type
Safe (S)
Borderline (B)
Rare (R)
Outlier (O)
Rule
\(\frac{k+1}{2k} < R_{\frac{min}{all}} \leqslant 1 \)
\(\frac{k-1}{2k} \leqslant R_{\frac{min}{all}} \leqslant \frac{k+1}{2k}\)
\(0< R_{\frac{min}{all}} < \frac{k-1}{2k}\)
\(R_{\frac{min}{all}} = 0\)
E.G. given the neighbourhood of a fixed size \(k=5\)
Rule
\(\frac{3}{5} < R_{\frac{min}{all}} \leqslant 1 \)
\(\frac{2}{5} \leqslant R_{\frac{min}{all}} \leqslant \frac{3}{5}\)
\(0< R_{\frac{min}{all}} < \frac{2}{5}\)
\(R_{\frac{min}{all}} = 0\)

2.3 Outlier Score

Many algorithms have been developed to deal with anomaly detection problems and the experiments in this paper are mainly performed with the nearest-neighbour based local outlier score (LOF). Local outlier factor (LOF), which indicates the degree of a sample being an outlier, was first introduced by Breunig et al. in 2000 [2]. The LOF of an object depends on its relative degree of isolation from its surrounding neighbours. Several definitions are needed to calculate the LOF and are summarized in the following Algorithm 1.
According to the definition of LOF, a value of approximately 1 indicates that the local density of data point \(X_i\) is similar to its neighbours. A value below 1 indicates that data point \(X_i\) locates in a relatively denser area and does not seem to be an anomaly, while a value significantly larger than 1 indicates that data point \(X_i\) is alienated from other points, which is most likely an outlier.

3 Experimental Setup

In this paper, we propose to introduce the four types of samples and the outlier score as additional attributes of the original imbalanced dataset, where the former can be expressed as \(R_{\frac{min}{all}}\) (Table 1) and the latter can be calculated through Python library PyOD [20].
The experiments reported in this paper are based on 7 two-class imbalanced datasets, including 6 imbalanced benchmark datasets (given in Table 2) and a 2D imbalanced chess dataset, which is commonly used for visualising the effectiveness of the selected techniques in the imbalanced learning domain [4]. Imbalance ratio (IR) is the ratio of the number of majority class samples to the number of minority class samples. For each dataset, we consider four scenarios, whether to perform resampling techniques on the original datasets and whether to perform resampling techniques on the datasets with additional attributes. For each scenario of each dataset, we repeat the experiments 30 times with different random seeds. After that, the paired t-tests were performed on each of the 30 performance metric values to test if there is significant difference between the results of each scenario on a 5% significance level. Each collected dataset is divided into 5 stratified folds (for cross-validation) and only the training set is oversampled, where the stratified fold is to ensure that the imbalance ratio in the training set is consistent with the original dataset and only oversampling the training set is to avoid over-optimism problem [15]. Our code is available on Github (https://​github.​com/​FayKong/​PPSN2020) for the convenience of reproducing the main results and figure.
Table 2.
Information on benchmark datasets [1].
Datasets
#Attributes
#Samples
Imbalance ratio (IR)
glass1
9
214
1.82
ecoli4
7
336
15.8
vehicle1
18
846
2.9
yeast4
8
1484
28.1
wine quality
11
1599
29.17
page block
10
5472
8.79
In this paper, we evaluate the performance through several different measures, including Area Under the ROC Curve (AUC), precision, recall, F-Measure (F1) and Geometric mean (Gmean) [13]. These performance measures can be calculated as follows.
$$\begin{aligned} \begin{aligned} AUC&= \frac{1+ TP_{rate} - FP_{rate}}{2}, \quad TP_{rate} = \frac{TP}{TP + FN}, \quad FP_{rate} = \frac{FP}{FP + TN};\\ GM&= \sqrt{\frac{TP}{TP + FN} \times \frac{TN}{FP + TN}}; \quad FM = \frac{(1 + \beta )^2 \times Recall \times Precision}{\beta ^2 \times Recall + Precision}\\ Recall&= TP_{rate} = \frac{TP}{TP + FN}, \quad Precision = \frac{TP}{TP + FP}, \quad \beta = 1; \end{aligned} \end{aligned}$$
where TPFNFPTN indicate True Positives, False Negatives, False Negatives and True Negatives in the standard confusion matrix for binary classification.

4 Experimental Results and Discussion

Like other studies [7, 13], we also use SVM and Decision Tree as the base classifiers in our experiments to compare the performance of the proposed method and the existing methods. Please note that we did not tune the hyperparameters for the classification algorithms and the resampling techniques [9]. The experimental results with the two additional attributes (four types of samples and LOF score) are presented in Table 3. We can observe that introducing outlier score and four types of samples as additional attributes can significantly improve the imbalanced classification performance in most cases. For 5 out of 7 datasets (2D chess dataset, glass1, yeast4, wine quality and page block), only introducing additional attributes (with no resampling) gives better results than performing resampling techniques.
Table 3.
Experimental results with SVM and Decision Tree. “Add = YES” means we introduce the two additional attributes to the original datasets; gray cells indicate that the proposed method (Add = YES) significantly outperforms the existing methods (Add = NO); “—” means that TP + FN = 0 or TP + FP = 0 and the performance metric cannot be computed.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-58112-1_35/MediaObjects/486037_1_En_35_Tab3_HTML.png
According to our experimental setup, we notice that introducing the outlier score focuses on dealing with the minority samples since the outlier score indicates the degree of a sample being an outlier. Meanwhile, introducing four types of samples (safe, borderline, rare and outlier) puts emphasis on separating the overlapping region and safe region. The visualisation of different scenarios for the 2D chess dataset is given in Fig. 1 in order to further study the reason for the performance improvement.
From both the experimental results in Table 3 and the visualisation in Fig. 1, we can conclude that, for the 2D chess dataset, the experiment with the two additional attributes outperforms the experiment with the classical resampling technique SMOTE. The figure also illustrates that the proposed method has a better ability to handle samples in the overlapping region.
Apart from the visualisation, the feature importance (with Decision Tree) is also analysed in order to get an additional insight into the usefulness of the new attributes. Detailed importance score of each attribute is shown in Table 4. According to the feature importance analysis, we can conclude that the introduced “four types of samples” attribute plays an important role in the decision tree classification process for all datasets in our experiment. For 3 out of 7 datasets, the introduced “outlier score” attribute provides useful information during the classification process. The conclusions above show that the two introduced attributes are actually used in the decision process and the “four types of samples” attribute is more important than the “outlier score” attribute.
Table 4.
Feature importance analysis with Decision Tree. The higher the “score” is, the more the feature contributes during the classification; “org” indicates the original attribute while “add” indicates the added attribute; grey cells indicate the three most useful attributes (after adding the two proposed attributes) in the decision tree classification process.
https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-58112-1_35/MediaObjects/486037_1_En_35_Tab4_HTML.png

5 Conclusions and Future Research

In this paper, we propose to introduce additional attributes to the original imbalanced datasets in order to improve the classification performance. Two additional attributes, namely four types of samples and outlier score, and the resampling techniques (SMOTE, ADASYN, NCL and OSS) are considered and experimentally tested on seven imbalanced datasets. According to our experimental results, two main conclusions can be derived:
1)
In most cases, introducing these two additional attributes can improve the class imbalance classification performance. For some datasets, only introducing additional attributes gives better classification results than only performing resampling techniques.
 
2)
An analysis of the experimental results also illustrates that the proposed method has a better ability to handle samples in the overlapping region.
 
In this paper, we only validate our idea with four resampling techniques and seven benchmark datasets. As future work, other anomaly detection techniques, such as the clustering-based local outlier score (CBLOF) [8] and histogram-based outlier score (HBOS) [6] could be included in the analysis. Future work could also consider an extension of this research for engineering datasets [10], especially for the design optimization problems mentioned in our Introduction. Detailed analysis of the feature importance and how the proposed method affects the classification performance in the overlapping region would also be worth studying.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Anhänge

Appendix A

Heterogeneous Value Difference Metric (HVDM)

HVDM is a heterogeneous distance function that returns the distance between two vectors \(\mathbf {x}\) and \(\mathbf {y}\) [18], where the vectors can involve both nominal and numerical attributes. The HVDM distance can be calculated by [18]:
$$\begin{aligned} HVDM(\mathbf {x}, \mathbf {y}) = \sqrt{\sum ^n_{a=1} {d_a}^2(x_a, y_a)}, \end{aligned}$$
(1)
where n is the number of attributes. The function \(d_a(\cdot )\) returns the distance between \(x_a\) and \(y_a\), where \(x_a\), \(y_a\) indicate the ath attribute of vector x and y respectively. It is defined as follows:
$$\begin{aligned} d_a(x, y) = {\left\{ \begin{array}{ll} 1, &{} \text {if }x\text { or }y\text { is unknown}\\ \text {norm}\_\text {vdm}_a(x, y), &{} \text {if }a\text {th attribute is nominal}\\ \text {norm}\_\text {diff}_a(x, y), &{} \text {if }a\text {th attribute is continuous} \end{array}\right. } \end{aligned}$$
(2)
where
$$\begin{aligned} \text {norm}\_\text {vdm}_a(x, y) = \sqrt{\sum _{c = 1}^C \left| \frac{N_{a,x,c}}{N_{a,x}} - \frac{N_{a,y,c}}{N_{a,y}} \right| ^2}, \quad \text {norm}\_\text {diff}_a(x, y) = \frac{\left| x-y \right| }{4\sigma _a}, \end{aligned}$$
(3)
where
  • C is the number of total output classes,
  • \(N_{a,x,c}\) is the number of instances which have value x for the ath attribute and output class c and \(N_{a,x} = \sum _{c=1}^C N_{a,x,c}\),
  • \(\sigma _a\) is the standard deviation of values of the ath attribute.
Literatur
1.
Zurück zum Zitat Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17 (2011) Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17 (2011)
2.
Zurück zum Zitat Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000) Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
3.
Zurück zum Zitat Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRef Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRef
5.
Zurück zum Zitat Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012) Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)
6.
Zurück zum Zitat Goldstein, M., Dengel, A.: Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track, pp. 59–63 (2012) Goldstein, M., Dengel, A.: Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track, pp. 59–63 (2012)
7.
Zurück zum Zitat He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008) He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
8.
Zurück zum Zitat He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24(9–10), 1641–1650 (2003)CrossRef He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24(9–10), 1641–1650 (2003)CrossRef
9.
Zurück zum Zitat Kong, J., Kowalczyk, W., Nguyen, D.A., Bäck, T., Menzel, S.: Hyperparameter optimisation for improving classification under class imbalance. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 3072–3078. IEEE (2019) Kong, J., Kowalczyk, W., Nguyen, D.A., Bäck, T., Menzel, S.: Hyperparameter optimisation for improving classification under class imbalance. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 3072–3078. IEEE (2019)
10.
Zurück zum Zitat Kong, J., Rios, T., Kowalczyk, W., Menzel, S., Bäck, T.: On the performance of oversampling techniques for class imbalance problems. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 84–96. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_7CrossRef Kong, J., Rios, T., Kowalczyk, W., Menzel, S., Bäck, T.: On the performance of oversampling techniques for class imbalance problems. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 84–96. Springer, Cham (2020). https://​doi.​org/​10.​1007/​978-3-030-47436-2_​7CrossRef
11.
Zurück zum Zitat Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol. 97, pp. 179–186 (1997) Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol. 97, pp. 179–186 (1997)
13.
Zurück zum Zitat López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef
15.
Zurück zum Zitat Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018)CrossRef Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., Santos, J.: Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput. Intell. Mag. 13(4), 59–76 (2018)CrossRef
16.
Zurück zum Zitat Skryjomski, P., Krawczyk, B.: Influence of minority class instance types on SMOTE imbalanced data oversampling. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 7–21 (2017) Skryjomski, P., Krawczyk, B.: Influence of minority class instance types on SMOTE imbalanced data oversampling. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 7–21 (2017)
17.
18.
Zurück zum Zitat Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)MathSciNetCrossRef Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)MathSciNetCrossRef
19.
Zurück zum Zitat Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2(3), 408–421 (1972) Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2(3), 408–421 (1972)
20.
Metadaten
Titel
Improving Imbalanced Classification by Anomaly Detection
verfasst von
Jiawen Kong
Wojtek Kowalczyk
Stefan Menzel
Thomas Bäck
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-58112-1_35

Premium Partner