Introduction
Preliminaries
Classification with imbalanced datasets
Actual | Predicted | |
---|---|---|
Positive | Negative | |
Positive | True positive (TP) | False negative (FN) |
Negative | False positive (FP) | True negative (TN) |
-
True-positive rate \(\hbox {TP}_\mathrm{rate}=\frac{\hbox {TP}}{\hbox {TP}\,+\,\hbox {FN}}\) is the percentage of positive instances correctly classified. This value is often known as sensitivity or recall.
-
True-negative rate \(\hbox {TN}_\mathrm{rate}=\frac{\hbox {TN}}{\hbox {FP}\,+\,\hbox {TN}}\) is the percentage of negative instances correctly classified. This value is often known as specificity.
-
False-positive rate \(\hbox {FP}_\mathrm{rate}=\frac{\hbox {FP}}{\hbox {FP}\,+\,\hbox {TN}}\) is the percentage of negative instances misclassified.
-
False-negative rate \(\hbox {FN}_\mathrm{rate}=\frac{\hbox {FN}}{\hbox {TP}\,+\,\hbox {FN}}\) is the percentage of positive instances misclassified.
Big Data and the MapReduce framework
Addressing imbalanced classification in Big Data problems: current state
Data pre-processing studies
Traditional data-based solutions for Big Data
Random oversampling with evolutionary feature weighting and random forest (ROSEFW-RF)
-
On the one hand, authors stressed that to deal with extremely imbalanced Big Data problems such as the one described above, this implies an increment in the density of the underrepresented class using higher oversampling ratios [43].
-
On the other hand, a feature selection approach was suggested to avoid the curse of dimensionality. Specifically, the authors developed a MapReduce implementation based on the evolutionary approach for Feature Weighting proposed in [58]. In this method, each map task performed a whole evolutionary feature weighting cycle in its data partition and emitted a vector of weights. Then, the Reduce process was responsible for the iterative aggregation of all the weights provided by the maps. Finally, the resulting weights were used with a threshold to select the most important characteristics.
Evolutionary undersampling
Data cleaning
NRSBoundary-SMOTE
Extreme learning machine with resampling
Multi-class imbalance
Summary
Cost-sensitive learning studies
Cost-sensitive SVM
Instance weighting SVM
Cost-sensitive random forest
Cost-sensitive fuzzy rule-based classification system (FRBCS)
Summary
Applications on imbalanced Big Data
Pairwise ortholog detection
Traffic accidents prediction
Biomedical data
Summary
Practical study on imbalanced Big Data classification using MapReduce
Datasets | #Ex. | #Atts. | Class (maj; min) | #Class(maj; min) | %Class(maj; min) | IR |
---|---|---|---|---|---|---|
ECBDL14-subset-12mill-90features | 12,000,000 | 90 | (0; 1) | (11,760,000; 240,000) | (98; 2) | 49 |
ECBDL14-subset-0.6mill-90features | 600,000 | 90 | (0; 1) | (588,000; 12,000) | (98; 2) | 49 |
Analysis of pre-processing techniques in MapReduce
Algorithm | Parameters |
---|---|
Decision trees (DT) and random forest (RF-S) | Number of trees: 1 (DT); 100 (RF-S) |
Number of bins used when discretizing continuous features: 100 | |
Impurity measure: gini | |
Randomly selected attributes: 9 | |
Maximum depth of each tree: 5 | |
Random forest Hadoop (RF-H) | Number of trees: 100 |
Randomly selected attributes: 7 | |
Maximum depth of each tree: unlimited |
Algorithm | 1 partitions | 8 partitions | 16 partitions | 32 partitions | 64 partitions | |||||
---|---|---|---|---|---|---|---|---|---|---|
GM\(_\mathrm{tr}\)
| GM\(_\mathrm{tst}\)
| GM\(_\mathrm{tr}\)
| GM\(_\mathrm{tst}\)
| GM\(_\mathrm{tr}\)
| GM\(_\mathrm{tst}\)
| GM\(_\mathrm{tr}\)
| GM\(_\mathrm{tst}\)
| GM\(_\mathrm{tr}\)
| GM\(_\mathrm{tst}\)
| |
ECBDL14-subset-0.6mill-90features
| ||||||||||
RF-H | ||||||||||
Without pre-processing | 0.56912 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
|
ROS | 1.00000 | 0.02240 | 0.98590 | 0.49250 | 0.94960 | 0.64410 | 0.89980 |
\({\underline{\mathbf{0.67770}}}\)
| 0.85110 |
0.66520
|
SMOTE | 0.37530 |
\(\underline{0.12380}\)
| 0.13270 | 0.08060 | 0.12760 | 0.07070 | 0.12640 | 0.09210 | 0.13780 | 0.07740 |
RUS | 0.85480 |
\({\underline{\mathbf{0.66450}}}\)
| 0.74610 |
0.65550
| 0.72660 |
0.65970
| 0.71910 | 0.65540 | 0.70300 | 0.64610 |
RF-S | ||||||||||
Without pre-processing | 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
|
ROS | 0.71220 | 0.64909 | 0.71241 |
0.65035
| 0.70892 | 0.64709 | 0.71142 |
0.64666
| 0.71121 |
\({\underline{\mathbf{0.65042}}}\)
|
SMOTE | 0.75876 | 0.60567 | 0.76859 | 0.62237 | 0.76594 | 0.62036 | 0.77538 | 0.62326 | 0.78057 |
\(\underline{0.62496}\)
|
RUS | 0.71437 |
\({\underline{\mathbf{0.65197}}}\)
| 0.71779 | 0.64921 | 0.71289 |
0.64782
| 0.71458 | 0.63898 | 0.71683 | 0.64767 |
DT | ||||||||||
Without pre-processing | 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
|
ROS | 0.70195 |
\({\underline{\mathbf{0.62987}}}\)
| 0.70408 | 0.62827 | 0.70195 |
\(\underline{0.62987}\)
| 0.70504 |
0.62873
| 0.70211 |
\({\underline{\mathbf{0.62987}}}\)
|
SMOTE | 0.71366 |
\(\underline{0.55672}\)
| 0.72406 | 0.50999 | 0.73219 | 0.45379 | 0.73166 | 0.44642 | 0.73323 | 0.47313 |
RUS | 0.70828 | 0.62528 | 0.70542 |
0.62987
| 0.70583 |
\({\underline{\mathbf{0.63204}}}\)
| 0.70413 | 0.62584 | 0.70390 | 0.61696 |
ECBDL14-subset-12mill-90features
| ||||||||||
RF-H | ||||||||||
Without pre-processing | * | * | 0.02579 |
\(\underline{0.00500}\)
| 0.01000 | 0.00000 | 0.00447 | 0.00000 | 0.00000 | 0.00000 |
ROS | * | * | 0.98350 | 0.50720 | 0.95120 | 0.63760 | 0.90890 |
0.69310
| 0.86160 |
\({\underline{\mathbf{0.70560}}}\)
|
SMOTE | * | * | N.D. | N.D. | N.D. | N.D. | 0.06625 |
\(\underline{0.05005}\)
| 0.07765 | 0.03535 |
RUS | * | * | 0.75970 |
\({\underline{\mathbf{0.69920}}}\)
| 0.74340 |
0.69510
| 0.73370 | 0.69190 | 0.72550 | 0.68880 |
RF-S | ||||||||||
Without pre-processing | * | * | 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
|
ROS | * | * | N.D. | N.D. | 0.70902 | 0.66998 | 0.70673 |
\({\underline{\mathbf{0.67091}}}\)
| 0.70599 | 0.66695 |
SMOTE | * | * | N.D. | N.D. | N.D. | N.D. | 0.75375 |
\(\underline{0.63239}\)
| 0.75816 | 0.63184 |
RUS | * | * | 0.70911 |
0.66983
| 0.70887 |
\({\underline{\mathbf{0.67055}}}\)
| 0.70827 | 0.66700 | 0.70780 |
0.66802
|
DT | ||||||||||
Without pre-processing | * | * | 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
| 0.00000 |
\(\underline{0.00000}\)
|
ROS | * | * | N.D. | N.D. | 0.70433 | 0.66517 | 0.70422 |
0.66472
| 0.70396 |
\(\underline{0.66551}\)
|
SMOTE | * | * | N.D. | N.D. | N.D. | N.D. | 0.71970 |
\(\underline{0.46296}\)
| 0.71037 | 0.44341 |
RUS | * | * | 0.70540 |
0.66625
| 0.70508 |
\({\underline{\mathbf{0.66652}}}\)
| 0.69734 | 0.66406 | 0.70618 |
0.66615
|
-
Classification models are shown to be more accurate in the case of making use of the full dataset, instead of the 5% sampling. This implies the benefit on the use of Big Data technologies, taking advantage of the whole representation of the problem to be solved. However, we must take into account two possible issues that may hinder the classification ability in these cases. On the one hand, the amount of noise presented in this kind of data may also be higher. On the other hand, it may suffer from the curse of dimensionality; therefore, adds more complexity to the modeling.
-
The results obtained with SMOTE combined with RF-S are better with respect to those obtained with SMOTE combined with RF-H or DT. Therefore, there is a significant influence of the implementation of the RF-S method to achieve this performance.
-
ROS and RUS show higher quality results than SMOTE-BigData (MapReduce implementation from [41]) in all experiments. This implies a complexity in the SMOTE algorithm for Big Data that we will discuss further. This behavior is also depicted in Fig. 2, which shows how the classification performance for the SMOTE algorithm is hindered to a certain limit as the number of partitions increases.
-
Regarding the previous point, we have further analyzed the source of these differences in performance between SMOTE and the random sampling techniques. Specifically, we used the full ECBDL14 dataset to show in Table 6 the true rates for the positive and negative classes obtained by the decision tree in the test partitions. We may observe that there is a good trade-off between both metrics in the case of ROS and RUS. However, for SMOTE pre-processing there is a bias towards the negative classes.
-
The inner working procedure of both ROS and RUS, which is based on the sampling of the minority versus majority class, allows them to be scalable approaches.
-
Finally, we must state that the degree of performance achieved mostly depends on the behavior of the learning algorithm. However, when contrasting ROS and RUS we observe better results for the former. This is due to the fact that ROS is somehow independent for the number of Maps, as it just makes exact copies of the instances which are shuffled among the chunks of data. On the contrary, for RUS the data distribution changes when removing some of the instances. Additionally, increasing the number of partitions have a more severe effect for RUS due to the lack of data.
Algorithm | 1 partitions | 8 partitions | 16 partitions | 32 partitions | 64 partitions | |||||
---|---|---|---|---|---|---|---|---|---|---|
TPR\(_\mathrm{tst}\)
| TNR\(_\mathrm{tst}\)
| TPR\(_\mathrm{tst}\)
| TNR\(_\mathrm{tst}\)
| TPR\(_\mathrm{tst}\)
| TNR\(_\mathrm{tst}\)
| TPR\(_\mathrm{tst}\)
| TNR\(_\mathrm{tst}\)
| TPR\(_\mathrm{tst}\)
| TNR\(_\mathrm{tst}\)
| |
ECBDL14-subset-12mill-90features
| ||||||||||
DT | ||||||||||
Without pre-processing | * | * | 0.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 1.00000 | 0.00000 | 1.00000 |
ROS | * | * | N.D. | N.D. | 0.63830 | 0.69317 | 0.63980 | 0.69062 | 0.65800 | 0.67312 |
SMOTE | * | * | N.D. | N.D. | N.D. | N.D. | 0.26040 | 0.82308 | 0.23345 | 0.84221 |
RUS | * | * | 0.65790 | 0.67470 | 0.64845 | 0.68508 | 0.66395 | 0.66417 | 0.64960 | 0.68312 |