Skip to main content
Erschienen in: Neural Computing and Applications 7/2021

Open Access 18.07.2020 | Original Article

CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification

verfasst von: Eyad Elyan, Carlos Francisco Moreno-Garcia, Chrisina Jayne

Erschienen in: Neural Computing and Applications | Ausgabe 7/2021

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Class-imbalanced datasets are common across several domains such as health, banking, security, and others. The dominance of majority class instances (negative class) often results in biased learning models, and therefore, classifying such datasets requires employing some methods to compact the problem. In this paper, we propose a new hybrid approach aiming at reducing the dominance of the majority class instances using class decomposition and increasing the minority class instances using an oversampling method. Unlike other undersampling methods, which suffer data loss, our method preserves the majority class instances, yet significantly reduces its dominance, resulting in a more balanced dataset and hence improving the results. A large-scale experiment using 60 public datasets was carried out to validate the proposed methods. The results across three standard evaluation metrics show the comparable and superior results with other common and state-of-the-art techniques.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Class-imbalance classification is a long withstanding problem in the literature [15] where a binary dataset contains a disproportionately larger amount of samples of the majority class (i.e., negative class) [6]. Such datasets are common in many domains including life sciences [7], protein classification [8], DNA sequence recognition [9], financial sector [10], Medical domain [11], Medicine rating and recommendations [12], engineering drawings analysis [1315] and others. An example of a binary classification problem is shown in Eq. 1. In a classification task, the aim is to learn a function \(\textit{h(x)}\) that maps an instance \(\mathbf{x} _{i} \in X\) to a class \(\mathbf{y} _{i}\), where \(y_i \in Y=\{C_N,C_P\}\), denoting negative and positive class, respectively. In an imbalanced dataset, the positive class \(C_P\) (class of interest) is often underrepresented in the dataset, causing a learning algorithm to be biased toward the majority class instances \(C_N\).
$$\begin{aligned} X = \begin{bmatrix} x_{11} &{} x_{12} &{} \dots &{} x_{1n} \\ x_{21} &{} x_{22} &{} \dots &{} x_{2n} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_{m1} &{} \vdots &{} \dots &{} x_{mn} \\ \end{bmatrix} ,\quad Y=\begin{bmatrix} y_{1} \\ y_{i} \\ \vdots \\ y_{m} \\ \end{bmatrix} \end{aligned}$$
(1)
Consider the banking sector, a dataset for handling fraudulent transactions. Most transactions are legitimate (i.e., 90–99%, \(x_i \in C_N\)), and few are fraud (class \(C_P\) in Eq. 1). In such scenario, an accuracy more than 90% can be easily obtained. However, it is easy to miss-classify the class of interest (i.e., \(x_i \in C_P\)), and hence, the need for different solutions accounts for the data distribution. Solutions for handling such a problem can be broadly categorized as data-based, algorithmic-based or cost-sensitive [2]. Data-based solutions are commonly used to compact class-imbalanced datasets. These methods are focused on either undersampling the data to reduce the dominance of the majority class instances, oversample the minority class, or a hybrid approach that combines both methods. Algorithmic-based solutions tend to modify the learning algorithms. Such algorithms include C4.5, k nearest neighbors (k-NN), support vector machine (SVM) and others.
Unlike most learning models which assign the same cost for all misclassifications in the learning process, cost-sensitive methods are based on the actual class and aiming at minimizing the total cost [16]. These methods emphasize the class of interest (positive class) by assigning higher costs for misclassifying it.
In this paper, we propose a new method for handling the class-imbalance problem based on class decomposition (CD) of the majority class and synthetic oversampling of the minority class. For short, we will refer to the proposed method as CDSMOTE. Our method is designed to first find the similarities within the majority class instances and group them accordingly. This results in reducing the dominance of the majority class without causing information loss, as it is the case with other undersampling techniques. To ensure a balanced distribution of the data, we then apply an oversampling method to improve the representation of the minority class. Extensive experiments were carried out, and the results show the superiority of the proposed method in improving different metrics when compared with the common and state-of-the-art methods.
The rest of the paper is organized as follows. Section 2 provides the necessary background and relevant literature. In Sect. 3, we present our method. Section 4 details the experiments with thorough evaluation and discussion of the results. Finally, Sect. 5 concludes the work and discusses possible future directions.
Handling the class-imbalance is most commonly achieved using ether data-based [17, 18] or algorithmic-based solutions [19]. Because of the purpose and scope of this paper, we will focus on data-based methods. For those interested in algorithmic-based solutions, we refer the reader to a recent survey [20] for more details. Data resampling is one of the standard methods for handling class-imbalanced datasets classification. These methods include undersampling, which aims at reducing the dominance of the majority class instances, oversampling, aiming at increasing the visibility of the minority class instances or hybrid approaches (combining both methods).
Random sampling is one of the most basic methods used to handle the class-imbalance problem. It can be applied as random undersampling (RUS), with the aim of rebalancing a dataset by randomly sampling a subset of the majority class instances, or as random oversampling (ROS), to multiply the instances of the minority class. This approach is simple, and thus, it is almost certain that it will result in losing data or overfitting. Therefore, these methods are rarely used alone. For example, in [21] a hybrid approach of RUS and a boosting algorithm (RUSBoost) was implemented to improve the classification results.
Other methods are focused on undersampling data from the overlapping region aiming at minimizing the overlap between positive and negative instances. Figure 1a shows a typical example of a hugely imbalanced dataset with the overlapping region highlighted in Fig. 1b, while Fig. 1c shows a possible solution where undersampling is carried out within that region. Several techniques are available to facilitate undersampling from the overlapping region. Among these, Tomek Link (T-Link) [22] which is a popular concept is originally proposed to edit the nearest neighbor rule which is used to remove instances in an overlapping region. The main idea is simple, given a dataset Z, two samples (a from the majority class and b from the minority class) and a distance function d between them, a T-Link is obtained if there is an example \(z\in Z\) such that:
$$\begin{aligned} \text{dist}(z,a)>\text{dist}(a,b) \wedge \text{dist}(z,b)>\text{dist}(a,b) \end{aligned}$$
(2)
The basic idea then is to discard sample a from the dataset whenever a T-link is obtained. This method is proved to be useful in handling class-imbalance and provided a better alternative to random sampling. For example, Kubat and Matwin [23] proposed an undersampling method by shrinking the overlapping region using T-Link [22]. This was achieved by selectively removing redundant majority class instances close to the class boundary. Better performance was reported based on real datasets. Devi et al. [24] proposed a more recent method based on T-Link which also aimed at removing noise and redundant negative instances from the overlapping region. Other similar approaches include neighboring cleaning rule (NCL) for small sets [25] and the majority undersampling technique (MUTE) [26]. Removing negative class instances selectively (i.e., from the overlapping region) often yields to better results. However, this does not prevent data loss which might affect the overall accuracy. Therefore, in some scenarios or application domains where the overall accuracy matters, alternatives should be considered to minimize the risk of losing information [2]. A recent work presented in [6] by Vuttipittayamongkol and Elyan followed a similar approach to selectively remove the negative instances from the overlapping region by using fuzzy C-Means and reported the comparable results with the state of the art. More recently, the authors extended their work by proposing new methods for handling class-imbalance where unlike other common resampling methods, they introduced a novel way to detect and remove negative instances from the overlapping region using neighborhood searching techniques and reported the comparable results with state-of-the-art methods [27].
Oversampling methods aiming at improving the presence of the minority class instances are also common practice. Synthetic minority oversampling technique (SMOTE) proposed by Chawla et al. [28] is still widely used in this domain. SMOTE is based on generating synthetic data points using a neighborhood-based technique (i.e., k-NN). Several extensions of SMOTE have been proposed since its introduction, including SMOTEBoost [29], Borderline-SMOTE [17], DBSMOTE [30], MWMOTE [31] and others. ADASYN [32] is another common oversampling method that is widely used. This method is based on assigning a higher weight to harder-to-learn samples (samples in the overlapping region) using k-NN.
Clustering-based methods are common practice across different domains [33] and widely used for undersampling data. A clustering method such as k-means or fuzzy C-means (FC-means) [34] is applied to cluster the majority class instances into k clusters. Data are then sampled from each cluster aiming at having a smaller and yet representative sample. As a result, a more balanced dataset is obtained. Bunkhumpornpat et al. [35] proposed a majority class undersampling technique based on density-based spatial clustering algorithm (DBMUTE). DBMUTE was designed to eliminate negative instances from the overlapping region. Lin et al. [36] presented another clustering-based undersampling method where the negative instances were first clustered with the number of clusters set to equal the number of data points in the minority class. The undersampling was then carried out using cluster centers and clusters nearest neighbors, respectively. An experiment using 44 public datasets showed the competitive results.
Clustering-based methods were also used to handle minority classes in the dataset. For example, Yong et al. [37] used k-means to divide the minority class into smaller clusters, and genetic algorithm was then used to generate new samples based on those clusters. This technique, however, will not be applicable when the number of minority class instances is minimal. Similarly, Seoane Santos et al. [38] handled patients data by clustering the minority class instances and then rebalanced the data using SMOTE. Puntumapon et al. [39] proposed a new method called TRIM, as a preprocessing stage before applying oversampling methods such as SMOTE or one of its extensions. Lim et al. [40] implemented an evolutionary ensemble learning framework by clustering the minority class instances using mini-batch k-means and hierarchical agglomerative clustering before generating synthetic samples.
Overall, it can be said that undersampling minority class instances contributed to improving the results before applying oversampling; however, such methods require enough samples from the minority class instances before they can be applied. More recently, Generative Adversarial Neural Networks (GAns) have been applied successfully to handle class-imbalance, by synthesizing new samples of the minority class’s instances to handle the imbalance problem. A typical example was presented in [14, 41, 42] where a new data augmentation approach using variants of GANs to handle the class-imbalance problem was presented. Using image-based datasets, the methods showed favorable performance over other traditional sampling techniques.

3 Methods

The method presented in this paper is designed to first reduce the dominance of the majority class instances in the dataset by applying unsupervised learning algorithm to group it into subclasses. An oversampling technique is then used to improve the presence of the minority class instances in the dataset. Algorithm 1 provides a schematic overview of the proposed method, where for any dataset A, first, it is transformed into a decomposed dataset \(A_c\) (Sect. 3.1), followed by oversampling of the minority class instances (oversample) subject to reassessing the decomposed dataset \(A_c\) (Sect. 3.2). If oversampling is then applied, then a dataset \(A_{cm}\) is created which is a result of class decomposition and oversampling combined. Finally, a learning algorithm is applied to the resulting dataset dataset.

3.1 Class decomposition

Class decomposition is achieved by applying a clustering algorithm to a training set and aims at minimizing the bias-variance trade-off [43] by creating more local boundaries within the dataset. The method was presented by Vilalta et al. [44], where experiments using 20 datasets showed an improvement in performance in Naive Bayes and SVM. In [45], Polaka used hierarchical and k-means clustering to decompose the majority class instances and reported an improvement in RF performance. More recently in [7], Elyan and Gaber extended this approach by applying class decomposition to all classes in the dataset. The results showed significant improvement. The K value (number of clusters) was set experimentally in this work. Later on, the authors [46] showed that RF performance via decomposition could be optimized using genetic algorithm. More recently, CD was applied to a set of engineering symbols extracted from engineering drawings, and it proved that the performance of SVM, RF and convolutional neural networks (CNN) was improved significantly [15].
In this paper, we follow a similar approach to [7, 46] by applying k-means clustering algorithm to the majority class instances. By decomposing the majority class into k subclasses, we aim to achieve two goals. First, reduce the dominance of the majority class instances and also avoid the loss of information which often results from applying other undersampling methods. This is illustrated in Fig. 2.
Figure 2 (left) shows the original dataset with the minority class instances (P), while the right side shows the dataset after applying class decomposition, which resulted in the same dataset but with different subclasses (clusters) representing the majority class instances (N) as \(N\_C1, N\_C2, \ldots \). Notice that with such an approach, we transform the dataset into different distributions and at the same time preserve all information. Consider the binary classification task in Eq. 3, where we want to learn h(x) that maps each instance \(x_i\) to a class \(y_i \in \{C_N,C_P\}\).
$$\begin{aligned} h(X):X \rightarrow Y \end{aligned}$$
(3)
Notice that in a classification task such as in Eq. 5, we aim to minimize the number of misclassification as shown in Eq. 4
$$\begin{aligned} \min \left( \sum _{i=1}^{m} (y_{i} \ne {\hat{y}}_{i})\right) \end{aligned}$$
(4)
where \(y_i\) is the actual class label, \({\hat{y}}_{i}\) is the predicted class label and m is the number of instances in the dataset. When we apply class decomposition to the dataset X in Eq. 3, we get a new classification task (Eq. 5).
$$\begin{aligned} h^{\prime }(X):X \rightarrow Y^{\prime } \end{aligned}$$
(5)
Here, we want to learn a function \(h^{\prime }(x)\) that maps each instance \(x_i\) to the corresponding label \(y^{\prime }_i \in \{C_{N1},C_{N2},\ldots C_{NK},C_P\}\), where K denotes the number of clusters. Notice that with such approach, we transform a binary classification problem into a multiclass classification problem. Here, transforming the data will not only reduce the dominance of the negative class \(C_N\) by clustering it into K subclasses, but will also allow training of the learning algorithm at a fine-grained level. The same objective function in Eq. 4 holds but with a minor change, such that each prediction is considered correct as far as it is within the main class of labels. In other words for any negative instance \(x_i \rightarrow C_N\), a predicted label \({\hat{y}}_i\) is considered correct, if and only if \({\hat{y}}_i \in \{C_{N1},C_{N2},\ldots C_{NK}\}\).

3.2 Minority class oversampling

Applying CD to a dataset will result in different data distribution. In other words, new minority/majority-class instances may appear (from within the clusters of the majority class instances). So, first, we check whether the number of samples in the minority class is close to the average number of samples of the majority subclasses. For instance, in Fig. 2, it is shown that the minority class is below the average number of samples of the five subclasses after class decomposition. (The horizontal line represents the average in red color in Fig. 2.) In this case, an oversampling is applied to the minority class. To oversample, we chose SMOTE [28] due to its efficiency and popularity as one of the most common oversampling methods. SMOTE requires two classes as input (a minority and a majority) to perform the oversampling of the minority class using the majority class samples as a reference for the synthetic sample generation. In this paper, we use the majority subclass with the number of samples closest to the mean as the majority class input to SMOTE. In Fig. 2, this would be \(N\_C3\). In cases where a tie takes place (i.e., more than one majority class to chose), one is selected at random. It has to be noted that these simple heuristics were chosen empirically when implementing CDSMOTE to handle class-imbalance classification. In other words, it was found that oversampling the minority class when it falls below the average number of subclasses yields better results overall.

4 Experiments

A large-scale experiment has been carried out aiming at comparing CDSMOTE with other common undersampling methods for handling class-imbalance data classification. In this experiment, CDSMOTE is compared against SMOTE [28] and ADASYN [32]. These were chosen as they are among the most common undersampling methods in the literature. Moreover, CDSMOTE is compared against class decomposition [7] and with recent and state-of-the-art methods including [36, 47, 48]. The following subsections describe the experiment in details.

4.1 Datasets

A collection of 60 datasets was used in this experiment, and these are publicly available and commonly used in class-imbalance data classification (i.e., [36, 47, 48], ...). The datasets were obtained from the KEEL repository.1 As can be seen in Table 1, these datasets are binary classification datasets with different imbalance ratios, different numbers of instances and a varied number of features.
Table 1
Datasets
Dataset
No. samples
No. features
Imbalance ratio
Glass1
214
9
1.82
Wisconsin
683
9
1.86
Ecoli0_vs_1
220
7
1.86
Pima
768
8
1.87
Iris0
150
4
2
Glass0
214
9
2.06
Yeast1
1484
8
2.46
Haberman
306
3
2.78
Vehicle2
846
18
2.88
Vehicle1
846
18
2.9
Vehicle3
846
18
2.99
Glass0123_vs_456
214
9
3.2
Vehicle0
846
18
3.25
Ecoli1
336
7
3.36
New-thyroid2
215
5
4.92
New-thyroid1
215
5
5.14
Ecoli2
336
7
5.46
Segment0
2308
19
6.02
Glass6
214
9
6.38
Yeast3
1484
8
8.1
Ecoli3
336
7
8.6
Page-blocks0
5472
10
8.79
Ecoli034_vs_5
200
7
9
Yeast2_vs_4
514
8
9.08
Ecoli067_vs_35
220
6
9.09
Ecoli0234_vs_5
202
7
9.1
Yeast0359_vs_78
506
8
9.12
Yeast0256_vs_3789
1004
8
9.14
Ecoil046_vs_5
203
6
9.15
Ecoli0346_vs_5
205
7
9.25
Ecoli0347_vs_56
257
7
9.28
Yeast05679_vs_4
528
8
9.35
Vowel0
988
13
9.98
Ecoli067_vs_5
220
6
10
Glass016_vs_2
192
9
10.29
Led7digit02456789_vs_1
443
7
10.97
Ecoli01_vs_5
240
6
11
Glass2
214
9
11.59
Ecoli0147_vs_56
332
6
12.28
Shuttle0_vs_4
1829
9
13.87
Yeast1_vs_7
459
7
14.3
Glass4
214
9
15.47
Ecoli4
336
7
15.8
Page-blocks13_vs_2
472
10
15.85
Abalone9-18
731
8
16.4
Dermatology6
358
34
16.87
Glass016_vs_5
184
9
19.44
Shuttle2_vs_4
129
9
20.5
Yeast1458_vs_7
693
8
22.1
Glass5
214
9
22.78
Yeast2_vs_8
482
8
23.1
Yeast4
1484
8
28.1
Winequality-red4
1599
11
29.17
Yeast1289_vs_7
947
8
30.57
Yeast5
1484
8
32.73
Ecoli0137_vs_26
281
7
39.14
Yeast6
1484
8
41.4
Winequality-red8_vs_67
855
11
46.5
Winequality-white39_vs_5
1482
11
58.28
Abalone19
731
8
129.44

4.2 Settings and implementation details

All datasets were partitioned into training and testing sets with a ratio of 80%, 20%, respectively, and fivefold cross-validation training. In all experiments, SVM with linear kernel was used as the learning algorithm. Other learning algorithms could have been considered, for example RF which showed the favorable results over other state-of-the-art methods [49] such as boosting and SVM. However, RF had shown already the favorable results concerning accuracy when class-decomposition was applied to the dataset as discussed in [7, 46]. In this work, we have chosen SVM with a linear kernel and default settings to establish the impact of class decomposition on class-imbalanced dataset classification.
Table 2 shows that each dataset was processed using SMOTE, ADASYN, CD, CDSMOTE and finally the baseline where no undersampling or oversampling. Regarding SMOTE and ADASYN, the number of nearest neighbors was set to equal 4 (\(k=4\)), and for class decomposition, we used k-means with \(k=2\). It is worth pointing out that we held these parameters fixed throughout, and no-parameter tuning was carried out to ensure a fair comparison between methods and to assess the impact of CDSMOTE on learning from imbalanced datasets using the three different evaluation metrics. First, we evaluate the results using Area Under the Curve (AUC) of the receiving operating characteristic (ROC) curve, which is a plot of the sensitivity or true positive rate (TPR) as a function of the false positive rate (FPR). The second evaluation metric we used is geometric mean (Gmean), which measures the balance between the TPR and the true negative rate (TNR) and is defined as \(\sqrt{\mathrm{TPR} \times \mathrm{TNR}}\). Finally, we used \(F_1\) Score between the TPR and the FPR [35] and is defined as \(F_1\) Score = \(\beta \times \frac{\mathrm{TPR} \times \mathrm{FPR}}{\mathrm{TPR} + \mathrm{FPR}}\), with \(\beta \) value = 2.
The experiments were implemented using Python 3.6 and were carried out on a Windows 10 machine with 16 GB RAM and a 2.7 GHz processor.
Table 2
CDSMOTE outperforming SMOTE, ADASYN and CD
Dataset
IR
Measure
Baseline
SMOTE
ADASYN
CD
CDSMOTE
Yeast1
2.46000
AUC
0.78900
0.79000
0.78800
0.87900
0.88400
Haberman
2.78000
AUC
0.70100
0.68900
0.67500
0.89200
0.89800
Haberman
2.78000
Fscore
0.06100
0.43900
0.44600
0.35400
0.47400
Vehicle2
2.88000
Gmean
0.96200
0.96200
0.95900
0.96600
0.96900
Vehicle2
2.88000
Fscore
0.94000
0.93400
0.93500
0.95000
0.95500
Glass0123_vs_456
3.20000
AUC
0.96800
0.96200
0.97000
0.97700
0.98000
Vehicle0
3.25000
Fscore
0.93500
0.94000
0.93400
0.93500
0.94500
Ecoli1
3.36000
Fscore
0.73800
0.77200
0.77100
0.79000
0.80000
Ecoli2
5.46000
Gmean
0.70000
0.91700
0.89400
0.78200
0.92000
Ecoli2
5.46000
Fscore
0.63200
0.72300
0.67000
0.65900
0.74700
Segment0
6.02000
Fscore
0.99000
0.99000
0.99000
0.99000
0.99500
Glass6
6.38000
AUC
0.94700
0.92300
0.92300
0.97900
0.98400
Yeast3
8.10000
AUC
0.96700
0.96900
0.96700
0.97700
0.97900
Yeast3
8.10000
Gmean
0.50600
0.89800
0.90900
0.74800
0.92300
Yeast3
8.10000
Fscore
0.40400
0.66900
0.62700
0.68600
0.79500
Ecoli3
8.60000
AUC
0.93300
0.94000
0.92600
0.95400
0.96000
Ecoli3
8.60000
Gmean
0.00000
0.88700
0.87500
0.00000
0.90300
Ecoli3
8.60000
Fscore
0.00000
0.57500
0.55100
0.00000
0.68600
Page-blocks0
8.79000
AUC
0.94200
0.96400
0.95900
0.98100
0.98400
Ecoli034_vs_5
9.00000
AUC
0.92200
0.87600
0.88100
0.97800
0.98500
Ecoli034_vs_5
9.00000
Gmean
0.88000
0.88400
0.87600
0.81200
0.90100
Ecoli034_vs_5
9.00000
Fscore
0.77900
0.72400
0.66800
0.75900
0.85000
Yeast2_vs_4
9.08000
AUC
0.88300
0.90300
0.91200
0.93800
0.96800
Yeast2_vs_4
9.08000
Fscore
0.45400
0.67200
0.62400
0.50900
0.74000
Ecoli0234_vs_5
9.10000
AUC
0.91100
0.88800
0.88400
0.97700
0.97800
Ecoli0234_vs_5
9.10000
Gmean
0.87600
0.90500
0.89300
0.85000
0.90600
Yeast0359_vs_78
9.12000
AUC
0.76400
0.77600
0.76500
0.93300
0.94000
Yeast0359_vs_78
9.12000
Fscore
0.33100
0.34600
0.30100
0.29300
0.39300
Ecoil046_vs_5
9.15000
AUC
0.87600
0.83000
0.80400
0.96600
0.98000
Ecoil046_vs_5
9.15000
Gmean
0.81100
0.83600
0.83000
0.78200
0.89500
Ecoil046_vs_5
9.15000
Fscore
0.78100
0.73400
0.64200
0.67900
0.85400
Ecoli0346_vs_5
9.25000
Gmean
0.84400
0.89900
0.87100
0.88300
0.93400
Ecoli0346_vs_5
9.25000
Fscore
0.77700
0.75700
0.61000
0.79000
0.82400
Ecoli0347_vs_56
9.28000
AUC
0.88400
0.88600
0.87000
0.97500
0.98100
Yeast05679_vs_4
9.35000
AUC
0.84000
0.84600
0.84500
0.94200
0.94500
Yeast05679_vs_4
9.35000
Fscore
0.00000
0.43300
0.42400
0.00000
0.45500
Vowel0
9.98000
Gmean
0.88400
0.94900
0.95800
0.93200
0.97600
Vowel0
9.98000
Fscore
0.83700
0.78900
0.81600
0.89900
0.91700
Glass016_vs_2
10.29000
Gmean
0.00000
0.51800
0.51600
0.00000
0.73700
Glass016_vs_2
10.29000
Fscore
0.00000
0.17600
0.19500
0.00000
0.22200
Led7digit02456789_vs_1
10.97000
AUC
0.95400
0.94900
0.92500
0.98000
0.98100
Ecoli01_vs_5
11.00000
AUC
0.93600
0.91000
0.91000
0.98600
0.98700
Glass2
11.59000
Fscore
0.00000
0.16200
0.14500
0.00000
0.17800
Ecoli0147_vs_56
12.28000
AUC
0.96200
0.95200
0.93900
0.98600
0.99000
Yeast1_vs_7
14.30000
Gmean
0.00000
0.76200
0.77500
0.00000
0.78400
Yeast1_vs_7
14.30000
Fscore
0.00000
0.29300
0.28400
0.00000
0.39100
Glass4
15.47000
AUC
0.99300
0.99300
0.99000
0.98800
0.99700
Glass4
15.47000
Gmean
0.00000
0.91200
0.90800
0.14100
0.97600
Glass4
15.47000
Fscore
0.00000
0.69900
0.66200
0.18200
0.82300
Ecoli4
15.80000
Gmean
0.28300
0.94300
0.93700
0.38300
0.96100
Ecoli4
15.80000
Fscore
0.30200
0.69100
0.69100
0.38400
0.81300
Abalone9-18
16.40000
AUC
0.92000
0.92000
0.92800
0.96600
0.97600
Abalone9-18
16.40000
Fscore
0.00000
0.42000
0.42400
0.00000
0.50600
Glass5
22.78000
Fscore
0.00000
0.66700
0.61100
0.48600
0.77400
Yeast2_vs_8
23.10000
AUC
0.78700
0.82400
0.80600
0.97700
0.98100
Yeast2_vs_8
23.10000
Gmean
0.73400
0.74400
0.68200
0.73400
0.76500
Yeast2_vs_8
23.10000
Fscore
0.69100
0.62600
0.20200
0.69100
0.79200
Yeast4
28.10000
Fscore
0.00000
0.29000
0.27900
0.00000
0.31200
Winequality-red4
29.17000
Fscore
0.00000
0.12600
0.12600
0.00000
0.14000
Yeast1289_vs_7
30.57000
Fscore
0.00000
0.12800
0.12800
0.00000
0.15200
Yeast5
32.73000
Gmean
0.00000
0.96700
0.96600
0.00000
0.96900
Yeast5
32.73000
Fscore
0.00000
0.47300
0.47300
0.00000
0.54000
Yeast6
41.40000
Gmean
0.00000
0.89200
0.87400
0.00000
0.89700
Yeast6
41.40000
Fscore
0.00000
0.32700
0.24200
0.00000
0.36600
Winequality-red8_vs_67
46.50000
Fscore
0.00000
0.09200
0.09200
0.00000
0.10800
Abalone19
129.44000
Fscore
0.00000
0.03900
0.05800
0.00000
0.07500

4.3 Results

As can be seen in Table 2, CDSMOTE outperformed all methods across one or more evaluation metric in 39 datasets. Moreover, it was observed that across the 60 datasets, CDSMOTE outperformed at least one method in one or more comparison. The comparison against CD was made to establish the need for applying oversampling after reducing the dominance of the majority class instances.
A closer look at Table 2, and comparing the performance of CDSMOTE against all other methods using Gmean, \(F_1\) Score and AUC, we can see an improvement gained by applying CDSMOTE. Statistical significance of the results was measured using the paired t test. With 95% confidence, the p-values for paired t-tests between CDSMOTE and all other methods across the three evaluation metrics are shown in Table 3, which clearly show a statistically significant improvement in performance using CDSMOTE.
Table 3
CDSMOTE versus other methods using Gmean, \(F_1\) score and AUC (t test)
 
Method
Gmean
\(F_1\)-score
AUC
1
Baseline
0.00055801
0.00000107
0.00001961
2
SMOTE
0.02154869
0.00000004
0.00000756
3
ADASYN
0.00537313
0.00003774
0.00000731
4
CD
0.00090928
0.00000125
0.00030986
It was also observed from the results that the best improvement across the three evaluation metrics was achieved using \(F_1\) Score. This suggests that CDSMOTE improves the presence of the minority-class instances and reduces the dominance of the majority-class cases. The results show also that CDSMOTE did not lose in any dataset against the three different methods (ADASYN, SMOTE and CD) combined. It was, however, observed that a similar performance (tie) was recorded in six different datasets and across the three evaluation metrics. These include Iris0, New-thyroid1, New-thyroid2, Shuttle0_vs_4, Shuttle2_vs_4 and Dermatology6. These are the datasets where 100% accuracy was recorded (i.e., \(F_1\) Score = 1).
For further evaluation, we compared our method with recent state-of-the-art techniques using the most recent results and reported the same experiment settings and datasets. First, we consider Cleofas-Sanchez et al. [47], who attempted class-imbalance classification using 31 of the datasets through a hybrid associative classifier with translation (HACT) based on SMOTE and used Gmean for evaluating the results. Table 4 lists the performance of CDSMOTE against this method. Then, we considered Lin et al. [36] who presented a clustering-based undersampling method on 44 datasets and reported performance using AUC. An Ensemble Adaboost C4.5 classifier was used for classification. The results in comparison with CDSMOTE is shown in Table 5. Finally, Zhu et al. [48] used 31 of the datasets used in this paper, and the authors adopted an algorithmic-based approach by designing their own classifier: Boundary-Eliminated Pseudoinverse Linear Discriminant (BEPILD). Table 6 compares CDSMOTE performance against BEPILD using the two metrics reported by the authors (AUC and Gmean).
Table 4 compares CDSMOTE with [47] in terms of Gmean. Notice that CDSMOTE obtains the better results in 20 out of 31 datasets. Using a paired t test on the 20 datasets where CDSMOTE wins shows a statistically significant difference with p value equal to 0.000506.
Table 4
CDSMOTE against SMOTE+HACT [47] using Gmean
Dataset
IR
CDSMOTE
SMOTE+HACT [47]
Wisconsin
1.86
0.96
0.977
Yeast1
2.46
0.582
0.662
Haberman
2.78
0.596
0.625
Vehicle1
2.9
0.78
0.63
Vehicle3
2.99
0.723
0.649
Glass0123_vs_456
3.2
0.91
0.866
Ecoli1
3.36
0.88
0.874
Ecoli2
5.46
0.92
0.892
Segment0
6.02
0.997
0.753
Glass6
6.38
0.94
0.884
Yeast3
8.1
0.923
0.869
Ecoli3
8.6
0.903
0.875
Ecoil034_vs_5
9
0.901
0.89
Yeast0256_vs_3789
9.14
0.766
0.768
Ecoli046_vs_5
9.15
0.895
0.878
Ecoli0346_vs_5
9.25
0.934
0.881
Ecoli0347_vs_56
9.28
0.823
0.864
Yeast05679_vs_4
9.35
0.776
0.81
Ecoli067_vs_5
10
0.899
0.835
Led7digit02456789_vs_1
10.97
0.874
0.888
Ecoli01_vs_5
11
0.897
0.894
Yeast1_vs_7
14.3
0.784
0.693
Glass4
15.47
0.976
0.867
Ecoli4
15.8
0.961
0.938
Yeast1458_vs_7
22.1
0.567
0.646
Glass5
22.81
0.788
0.497
Yeast2_vs_8
23.1
0.765
0.74
Yeast4
28.1
0.794
0.829
Yeast1289_vs_7
30.57
0.602
0.691
Ecoli0137_vs_26
39.14
0.733
0.816
Yeast6
41.4
0.897
0.873
Bold font indicates the winning method
Table 5 compares CDSMOTE and [36] across AUC, where it is shown that CDSMOTE outperformed [36] in 37 out of 44 datasets. A paired t test shows significant statistical improvement with a p value of \(2.633 \times 10^{-7}\).
Table 5
CDSMOTE against Clust+\(C4.5_{Ab}\) [36] using AUC
Dataset
IR
CDSMOTE
Clust+\(C4.5_{Ab}\) [36]
Glass1
1.82
0.832
0.834
Ecoli0_vs_1
1.86
0.979
0.983
Wisconsin
1.86
0.983
0.99
Pima
1.87
0.905
0.758
Iris0
2
0.99
0.99
Glass0
2.06
0.898
0.89
Yeast1
2.46
0.884
0.747
Haberman
2.78
0.898
0.641
Vehicle2
2.88
0.992
0.995
Vehicle1
2.9
0.933
0.832
Vehicle3
2.99
0.93
0.848
Glass0123_vs_456
3.2
0.98
0.982
Vehicle0
3.25
0.992
0.946
Ecoli1
3.36
0.974
0.94
New-thyroid2
4.92
0.996
0.956
New-thyroid1
5.14
0.999
0.973
Ecoli2
5.46
0.967
0.947
Segment0
6.02
1
0.996
Glass6
6.38
0.984
0.917
Yeast3
8.1
0.979
0.967
Ecoli3
8.6
0.96
0.926
Page-blocks0
8.79
0.984
0.986
Yeast2_vs_4
9.08
0.968
0.977
Yeast05679_vs_4
9.35
0.945
0.869
Vowel0
9.98
0.997
0.987
Glass016_vs_2
10.29
0.915
0.75
Glass2
11.59
0.923
0.76
Shuttle0_vs_4
13.87
1
1
Yeast1_vs_7
14.3
0.934
0.768
Glass4
15.47
0.997
0.853
Ecoli4
15.8
0.996
0.95
Page-blocks13_vs_2
15.85
0.995
0.992
Abalone9-18
16.4
0.976
0.831
Glass016_vs_5
19.44
0.991
0.964
Shuttle2_vs_4
20.5
1
1
Yeast1458_vs_7
22.1
0.858
0.632
Glass5
22.78
0.995
0.949
Yeast2_vs_8
23.1
0.981
0.868
Yeast4
28.1
0.958
0.874
Yeast1289_vs_7
30.57
0.936
0.7
Yeast5
32.73
0.984
0.987
Ecoli0137_vs_26
39.14
0.993
0.838
Yeast6
41.4
0.973
0.909
Abalone19
129.44
0.972
0.728
Bold font indicates the winning method
Table 6 shows the comparison of CDSMOTE against the BEPILD method presented in [48] for Gmean and AUC. In terms of Gmean, CDSMOTE obtains the better results in 13 out of 31 datasets. The difference in performance in these datasets is not statistically significant (using t test resulted in a p value = 0.0695); however, for some application domains, such as health, life science and security, such improvement in performance could be crucial. Considering only the 13 winning datasets in Table 6, we found out a statistically significant improvement using CDSMOTE using a paired t test resulting in a p value of 0.006054. When measuring performance using AUC Table 6, our proposed method proved to be superior across almost all datasets. Out of 31 datasets, CDSMOTE outperformed BEPILD in 30 datasets. Using a t test, a p value of \(2.821 \times 10^{-10}\) was obtained.
Table 6
CDSMOTE against BEPILD [48] using Gmean and AUC
Dataset
IR
CDSMOTE\(_{\mathrm{Gmean}}\)
BEPILD\(_{\mathrm{Gmean}}\)
CDSMOTE\(_{\mathrm{AUC}}\)
BEPILD\(_{\mathrm{AUC}}\)
Wisconsin
1.86
0.96
0.975
0.983
0.975
Pima
1.87
0.728
0.748
0.905
0.749
Yeast1
2.46
0.582
0.71
0.884
0.711
Vehicle1
2.9
0.78
0.787
0.933
0.788
Vehicle0
3.25
0.968
0.946
0.992
0.946
Glass6
6.38
0.940
0.919
0.984
0.92
Yeast3
8.1
0.923
0.899
0.979
0.899
Ecoli3
8.6
0.903
0.875
0.960
0.88
Ecoli034vs5
9
0.901
0.904
0.985
0.908
Ecoli067vs35
9.09
0.795
0.865
0.973
0.88
Ecoli0234vs5
9.1
0.906
0.908
0.978
0.917
Yeast0359vs78
9.12
0.695
0.744
0.940
0.755
Ecoil046vs5
9.15
0.895
0.892
0.980
0.906
Yeast05679vs4
9.35
0.776
0.784
0.945
0.789
Ecoli067vs5
10
0.899
0.874
0.981
0.877
Glass016vs2
10.29
0.737
0.734
0.915
0.75
Led7digit02456789vs1
10.97
0.874
0.898
0.981
0.902
Ecoli0147vs56
12.28
0.902
0.887
0.990
0.89
Yeast1vs7
14.3
0.784
0.764
0.934
0.766
Ecoli4
15.8
0.961
0.919
0.996
0.923
Abalone9-18
16.4
0.812
0.994
0.976
0.883
Dermatology6
16.87
0.999
0.882
0.999
0.994
Yeast1458vs7
22.1
0.567
0.62
0.858
0.632
Yeast4
28.1
0.794
0.83
0.958
0.831
Winequality-red4
29.17
0.662
0.696
0.961
0.7
Yeast1289vs7
30.57
0.602
0.716
0.936
0.725
Yeast5
32.73
0.969
0.962
0.984
0.987
Yeast6
41.4
0.897
0.875
0.973
0.878
Winequality-red8vs67
46.5
0.636
0.721
0.970
0.74
Winequality-white39vs5
58.28
0.586
0.629
0.982
0.657
Abalone19
129.44
0.723
0.775
0.972
0.728
Bold font indicates the winning method

4.4 Discussion

To summarize, out of 60 datasets, CDSMOTE proved to be superior to the most common and established methods used in handling class-imbalanced datasets classification. These methods include SMOTE [28] and ADASYN [32] and CD [46]. The improvement across three common evaluation metrics (Gmean, \(F_1\) Score and AUC) was statistically significant as shown in Tables 3. These results suggest that applying class decomposition to a majority class instances in a binary dataset does not only reduce the dominance of the majority class but also such decomposition provides a more linearly separable space within the local class boundaries.
The proposed method also showed superior performance over recent and state-of-the-art methods presented in the literature such as Cleofas-Sanchez et al. [36, 47], and [48], as can be seen in Tables 4, 5 and 6. An improvement over these methods was statistically significant. The results also showed that the best trade-off between AUC and Gmean, the two classically used metrics for imbalanced datasets, was obtained by CDSMOTE. Overall, the proposed method obtains the better results in terms of AUC than other methods. CDSMOTE maximizes the \(F_1\) Score results, meaning that it effectively offers the best trade-off between the precision and recall for the minority class. Therefore, CDSMOTE can provide an alternative to handle the class-imbalance problem in specific scenarios. It has to be pointed out that there is a large room for improving these results. This includes hyper-parameters tuning and optimization (optimize the k value), further experiments with different learning algorithms (i.e., ensemble-based methods), using alternative clustering methods (i.e., soft clustering techniques, density-based and others) or using different oversampling methods such as GANs.

5 Conclusions and future work

In this paper, we have presented a new approach for handling class-imbalance problem by means of class decomposition. Unlike most common undersampling methods, our method suffers no data loss and preserves all majority class instances. A large-scale experiment showed that CDSMOTE produces the comparable results with state-of-the-art methods, while significantly outperforming some of the most established methods across metrics such as AUC, Gmean and \(F_1\) Score. The number of datasets used in this experiment with different sizes, dimensions and imbalance ratios suggests that the proposed methods can generalize and scalable across larger and more diverse datasets. It has to be noted that these results were obtained using default parameters settings and with one classifier, namely SVM with a linear kernel, meaning that further improvement can be made at the data level as well as the algorithmic level. At the data level, the method presented in this paper can benefit from better clustering and grouping of the majority class instances. This might include isolating the instances within the overlapping region. At the algorithmic level, we intend to examine other learning algorithms, in particular, ensemble-based classification methods such as RF, which has proved to outperform other learning methods. Also, the use of other clustering methods can be explored for further improvement in the results. This might include considering density-based clustering methods instead of using k-means, which is often sensitive to noise. Finally, the results can be further improved by applying some parameter tuning techniques, to ensure that the best parameter setting is chosen.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef
2.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449CrossRef Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449CrossRef
3.
Zurück zum Zitat Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36 Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36
4.
Zurück zum Zitat Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, MA Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, MA
5.
Zurück zum Zitat Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRef
6.
Zurück zum Zitat Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: Yin H, Camacho D, Novais P, Tallón-Ballesteros AJ (eds) Intelligent data engineering and automated learning—IDEAL 2018. Springer, Cham, pp 689–697CrossRef Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: Yin H, Camacho D, Novais P, Tallón-Ballesteros AJ (eds) Intelligent data engineering and automated learning—IDEAL 2018. Springer, Cham, pp 689–697CrossRef
7.
Zurück zum Zitat Elyan E, Gaber MM (2016) A fine-grained random forests using class decomposition: an application to medical diagnosis. Neural Comput Appl 27(8):2279–2288CrossRef Elyan E, Gaber MM (2016) A fine-grained random forests using class decomposition: an application to medical diagnosis. Neural Comput Appl 27(8):2279–2288CrossRef
8.
Zurück zum Zitat Zhao XM, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins 70(2):311–319 Zhao XM, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins 70(2):311–319
9.
Zurück zum Zitat García-Pedrajas N, Pérez-Rodríguez J, García-Pedrajas M, Ortiz-Boyer D, Fyfe C (2012) Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl Based Syst 25(1):22–34CrossRef García-Pedrajas N, Pérez-Rodríguez J, García-Pedrajas M, Ortiz-Boyer D, Fyfe C (2012) Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl Based Syst 25(1):22–34CrossRef
10.
Zurück zum Zitat Kim MJ, Kang DK, Kim HB (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst Appl 42(3):1074–1082CrossRef Kim MJ, Kang DK, Kim HB (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst Appl 42(3):1074–1082CrossRef
11.
Zurück zum Zitat Vuttipittayamongkol P, Elyan E (2020) Overlap-based undersampling method for classification of imbalanced medical datasets. In: Maglogiannis I, Iliadis L, Pimenidis E (eds) Artificial intelligence applications and innovations. Springer, Cham, pp 358–369CrossRef Vuttipittayamongkol P, Elyan E (2020) Overlap-based undersampling method for classification of imbalanced medical datasets. In: Maglogiannis I, Iliadis L, Pimenidis E (eds) Artificial intelligence applications and innovations. Springer, Cham, pp 358–369CrossRef
12.
Zurück zum Zitat Li S, Hao F, Li M, Kim H-C (2013) Medicine rating prediction and recommendation in mobile social networks. In: Park JJ, Arabnia HR, Kim C, Shi W, Gil J-M (eds) Grid and pervasive computing. Springer, Berlin, pp 216–223CrossRef Li S, Hao F, Li M, Kim H-C (2013) Medicine rating prediction and recommendation in mobile social networks. In: Park JJ, Arabnia HR, Kim C, Shi W, Gil J-M (eds) Grid and pervasive computing. Springer, Berlin, pp 216–223CrossRef
13.
Zurück zum Zitat Elyan E, Moreno-García CF, Johnston P (2020) Symbols in engineering drawings (SIED): an imbalanced dataset benchmarked by convolutional neural networks. In: Iliadis L, Angelov PP, Jayne C, Pimenidis E (eds) Proceedings of the 21st EANN (engineering applications of neural networks) 2020 conference. Springer, Cham, pp 215–224CrossRef Elyan E, Moreno-García CF, Johnston P (2020) Symbols in engineering drawings (SIED): an imbalanced dataset benchmarked by convolutional neural networks. In: Iliadis L, Angelov PP, Jayne C, Pimenidis E (eds) Proceedings of the 21st EANN (engineering applications of neural networks) 2020 conference. Springer, Cham, pp 215–224CrossRef
14.
Zurück zum Zitat Elyan E, Jamieson L, Ali-Gombe A (2020) Deep learning for symbols detection and classification in engineering drawings. Neural Netw 129:91–102CrossRef Elyan E, Jamieson L, Ali-Gombe A (2020) Deep learning for symbols detection and classification in engineering drawings. Neural Netw 129:91–102CrossRef
15.
Zurück zum Zitat Elyan E, Moreno-Garcia CF, Jayne C (2018) Symbols classification in engineering drawings. In: International joint conference on neural networks (IJCNN) Elyan E, Moreno-Garcia CF, Jayne C (2018) Symbols classification in engineering drawings. In: International joint conference on neural networks (IJCNN)
16.
Zurück zum Zitat Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1–8 Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1–8
17.
Zurück zum Zitat Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef
18.
Zurück zum Zitat Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput J 38:714–726CrossRef Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput J 38:714–726CrossRef
19.
Zurück zum Zitat Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, Berlin, pp 277–306 Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, Berlin, pp 277–306
20.
Zurück zum Zitat Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2):31:1–31:50CrossRef Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2):31:1–31:50CrossRef
21.
Zurück zum Zitat Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef
22.
Zurück zum Zitat Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern SMC–6(6):448–452MathSciNetMATH Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern SMC–6(6):448–452MathSciNetMATH
23.
Zurück zum Zitat Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. Int Conf Mach Learn 97:179–186 Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one sided selection. Int Conf Mach Learn 97:179–186
24.
Zurück zum Zitat Devi D, Biswas S, Biswajit P (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRef Devi D, Biswas S, Biswajit P (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRef
25.
Zurück zum Zitat Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S (eds) Artificial intelligence in medicine. Springer, Berlin, pp 63–66CrossRef Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S (eds) Artificial intelligence in medicine. Springer, Berlin, pp 63–66CrossRef
26.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: International conference on information, communications and signal processing, pp 1–4 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: International conference on information, communications and signal processing, pp 1–4
27.
Zurück zum Zitat Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70CrossRef Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70CrossRef
28.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef
29.
Zurück zum Zitat Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Knowledge discovery in databases: KDD 2003. Springer, Berlin, pp 107–119 Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Knowledge discovery in databases: KDD 2003. Springer, Berlin, pp 107–119
30.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684CrossRef Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684CrossRef
31.
Zurück zum Zitat Barua S, Islam M, Yao X, Murase K (2014) MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425CrossRef Barua S, Islam M, Yao X, Murase K (2014) MWMOTE: majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425CrossRef
32.
Zurück zum Zitat Haibo H, Bai Y, Garcia EA, Li S (2008) Adaptive synthetic sampling approach for imbalanced learning. Int Jt Conf Neural Netw (IJCNN) 3:1322–1328 Haibo H, Bai Y, Garcia EA, Li S (2008) Adaptive synthetic sampling approach for imbalanced learning. Int Jt Conf Neural Netw (IJCNN) 3:1322–1328
33.
Zurück zum Zitat Li S, Chen W, Li S, Leung K-S (2019) Improved algorithm on online clustering of bandits. In: Proceedings of the 28th international joint conference on artificial intelligence, AAAI Press, pp 2923–2929 Li S, Chen W, Li S, Leung K-S (2019) Improved algorithm on online clustering of bandits. In: Proceedings of the 28th international joint conference on artificial intelligence, AAAI Press, pp 2923–2929
34.
Zurück zum Zitat Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203CrossRef Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203CrossRef
35.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K (2017) DBMUTE: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850CrossRef Bunkhumpornpat C, Sinapiromsaran K (2017) DBMUTE: density-based majority under-sampling technique. Knowl Inf Syst 50(3):827–850CrossRef
36.
Zurück zum Zitat Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26CrossRef Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26CrossRef
37.
Zurück zum Zitat Yong Y (2012) The research of imbalanced data set of sample sampling method based on K-means cluster and genetic algorithm. Energy Procedia 17:164–170CrossRef Yong Y (2012) The research of imbalanced data set of sample sampling method based on K-means cluster and genetic algorithm. Energy Procedia 17:164–170CrossRef
38.
Zurück zum Zitat Seoane Santos M, Henriques Abreu P, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59CrossRef Seoane Santos M, Henriques Abreu P, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59CrossRef
39.
Zurück zum Zitat Puntumapon K, Rakthamamon T, Waiyamai K (2016) Cluster-based minority over-sampling for imbalanced datasets. IEICE Trans Inf Syst 99(12):3101–3109CrossRef Puntumapon K, Rakthamamon T, Waiyamai K (2016) Cluster-based minority over-sampling for imbalanced datasets. IEICE Trans Inf Syst 99(12):3101–3109CrossRef
40.
Zurück zum Zitat Lim P, Goh CK, Tan KC (2017) Evolutionary cluster-based synthetic oversampling ensemble (ECO-ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861CrossRef Lim P, Goh CK, Tan KC (2017) Evolutionary cluster-based synthetic oversampling ensemble (ECO-ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861CrossRef
41.
Zurück zum Zitat Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: 2019 International joint conference on neural networks (IJCNN), pp 1–8 Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: 2019 International joint conference on neural networks (IJCNN), pp 1–8
42.
Zurück zum Zitat Ali-Gombe A, Elyan E (2019) MFC-GAN: class-imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing 361:212–221CrossRef Ali-Gombe A, Elyan E (2019) MFC-GAN: class-imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing 361:212–221CrossRef
43.
Zurück zum Zitat Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58CrossRef Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58CrossRef
44.
Zurück zum Zitat Vilalta R, Rish I (2003) A decomposition of classes via clustering to explain and improve naive Bayes. In: Machine learning: ECML 2003, pp 1–12 Vilalta R, Rish I (2003) A decomposition of classes via clustering to explain and improve naive Bayes. In: Machine learning: ECML 2003, pp 1–12
45.
Zurück zum Zitat Polaka I (2013) Clustering algorithm specifics in class decomposition. In: Proceedings of the international scientific conference Polaka I (2013) Clustering algorithm specifics in class decomposition. In: Proceedings of the international scientific conference
46.
Zurück zum Zitat Elyan E, Gaber MM (2017) A genetic algorithm approach to optimising random forests applied to class engineered data. Inf Sci 384:220–234CrossRef Elyan E, Gaber MM (2017) A genetic algorithm approach to optimising random forests applied to class engineered data. Inf Sci 384:220–234CrossRef
47.
Zurück zum Zitat Cleofas-Sánchez L, Sánchez JS, García V, Valdovinos RM (2016) Associative learning on imbalanced environments: an empirical study. Expert Syst Appl 54:387–397CrossRef Cleofas-Sánchez L, Sánchez JS, García V, Valdovinos RM (2016) Associative learning on imbalanced environments: an empirical study. Expert Syst Appl 54:387–397CrossRef
48.
Zurück zum Zitat Zhu Y, Wang Z, Zha H, Gao D (2017) Boundary-eliminated pseudoinverse linear discriminant for imbalanced problems. IEEE Trans Neural Netw Learn Syst 29(6):2581–2594MathSciNetCrossRef Zhu Y, Wang Z, Zha H, Gao D (2017) Boundary-eliminated pseudoinverse linear discriminant for imbalanced problems. IEEE Trans Neural Netw Learn Syst 29(6):2581–2594MathSciNetCrossRef
49.
Zurück zum Zitat Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15:3133–3181MathSciNetMATH Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15:3133–3181MathSciNetMATH
Metadaten
Titel
CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification
verfasst von
Eyad Elyan
Carlos Francisco Moreno-Garcia
Chrisina Jayne
Publikationsdatum
18.07.2020
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 7/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-020-05130-z

Weitere Artikel der Ausgabe 7/2021

Neural Computing and Applications 7/2021 Zur Ausgabe