Imbalanced data sets classification: an intro
Related work
Organization of work
Architecture
Experimental workflow
-
Using step 2 and 3, the newly updated data helps to improve Random Forest and can consequently be examined for cluster cohesiveness.
-
Repeat step 4 for real-time streaming input data set.
Over_sampling techniques
UCPMOT: a rationalized technique
Other basic techniques
Non-cluster based over_sampling techniques
Technique-1
Safe-level based synthetic samples creation (SSS)
Technique-2
Technique-3
Clustering based over_sampling techniques
Technique-4
Experimental context
Details of the data set under experimentation
Category | Data set | #EX | #IR | #ATTR | #CL |
---|---|---|---|---|---|
Multi-class semi-structured/un-structured data sets | PAMAP2 | 3,850,505 | 14.35 | 54 | 19 |
Landstat | 6435 | 2.44 | 37 | 7 | |
Mashup | 9135 | 623 | 8 | 67 | |
SIDO | 12,678 | 27.04 | 4932 | 2 | |
Multi-class structured data sets | Yeast | 1484 | 92.6 | 9 | 10 |
Car | 1728 | 18.61 | 6 | 4 | |
KEGG-U | 65,554 | 5959.45 | 29 | 43 | |
Binary-class structured data sets | MiniBoone | 130,065 | 2.56 | 51 | 2 |
Credit card | 284,808 | 577.87 | 31 | 2 | |
RLCP | 5,749,132 | 273.67 | 12 | 2 |
Input pre-processing
-
Tokenization,
-
Stop word removal,
-
Stemming.
-
For each map (k1,v1).
-
Clustering in respect to update Cc (k-means).
-
Output (ci,ni).
-
Reduce (Cc,ci,ni).
-
Merge (mci,mni).
-
Check for new Cc, k1, v1.
-
Output (Cc,k1,v1,mci,mni).
Assumptions and environmental pre-settings
Performance evaluation parameters
-
Pi: precision of ith class = True Positive/(True Positive + False Positive)
-
Ri: recall of ith class = Sensitivity(Recall) = True Positive/(True Positive + False Negative)
Notations
Notation | Algorithms |
---|---|
A | Original data set result |
B | SMOTE |
C | Borderline-SMOTE |
D | ADASYN |
E | SPIDER2 |
F | SMOTEBoost |
G | MWMOTE |
H | UCPMOT_MEMMOT |
I | UCPMOT_MMMmOT |
J | UCPMOT_CMEOT |
K | UCPMOT_NF_N + MOT |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | PAMAP2 | 0.43 | 0.63 | 0.67 | 0.65 | 0.7 | 0.64 | 0.74 | 0.76 |
0.77
| 0.75 | 0.67 |
Landstat | 0.84 | 0.87 | 0.87 | 0.86 | 0.9 | 0.85 | 0.91 | 0.95 |
0.96
| 0.93 | 0.9 | |
Mashup | 0.26 | 0.33 | 0.36 | 0.35 | 0.5 | 0.34 | 0.53 | 0.54 |
0.58
| 0.55 | 0.4 | |
SIDO | 0.86 | 0.89 | 0.9 | 0.89 | 0.92 | 0.89 | 0.93 | 0.94 |
0.95
| 0.93 | 0.91 | |
Naïve Bayes | PAMAP2 | 0.39 | 0.58 | 0.61 | 0.59 | 0.64 | 0.59 | 0.7 | 0.71 |
0.74
|
0.74
| 0.62 |
Landstat | 0.81 | 0.84 | 0.85 | 0.85 | 0.88 | 0.84 | 0.91 | 0.91 |
0.92
| 0.91 | 0.86 | |
Mashup | 0.24 | 0.3 | 0.33 | 0.32 | 0.46 | 0.31 | 0.5 | 0.51 |
0.54
| 0.53 | 0.36 | |
SIDO | 0.83 | 0.86 | 0.87 | 0.85 | 0.89 | 0.85 | 0.91 | 0.91 |
0.92
|
0.92
| 0.89 | |
AdaBoostM1 | PAMAP2 | 0.4 | 0.6 | 0.64 | 0.63 | 0.69 | 0.62 | 0.73 | 0.73 |
0.75
| 0.74 | 0.65 |
Landstat | 0.82 | 0.86 | 0.86 | 0.86 | 0.91 | 0.86 | 0.92 | 0.93 |
0.93
| 0.92 | 0.89 | |
Mashup | 0.25 | 0.32 | 0.35 | 0.34 | 0.47 | 0.33 | 0.51 | 0.52 |
0.56
| 0.54 | 0.38 | |
SIDO | 0.85 | 0.88 | 0.89 | 0.88 | 0.9 | 0.93 | 0.92 | 0.93 |
0.94
| 0.93 | 0.9 | |
MultiLayer Perceptron | PAMAP2 | 0.4 | 0.59 | 0.63 | 0.63 | 0.68 | 0.62 | 0.73 | 0.73 |
0.74
|
0.74
| 0.64 |
Landstat | 0.81 | 0.85 | 0.86 | 0.85 | 0.9 | 0.85 | 0.91 | 0.92 |
0.92
|
0.92
| 0.88 | |
Mashup | 0.25 | 0.31 | 0.34 | 0.33 | 0.49 | 0.32 | 0.52 | 0.53 |
0.55
| 0.53 | 0.37 | |
SIDO | 0.84 | 0.87 | 0.88 | 0.87 | 0.9 | 0.87 | 0.9 | 0.92 |
0.93
| 0.92 | 0.9 | |
Overall average | 0.58 | 0.66 | 0.68 | 0.67 | 0.74 | 0.67 | 0.77 | 0.77 |
0.79
| 0.78 | 0.70 |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | PAMAP2 | 0.44 | 0.62 | 0.67 | 0.65 | 0.72 | 0.64 | 0.73 | 0.74 |
0.77
| 0.75 | 0.68 |
Landstat | 0.84 | 0.88 | 0.91 | 0.9 | 0.92 | 0.89 | 0.93 | 0.94 |
0.96
| 0.94 | 0.91 | |
Mashup | 0.28 | 0.34 | 0.38 | 0.36 | 0.51 | 0.35 | 0.54 | 0.54 |
0.58
| 0.56 | 0.41 | |
SIDO | 0.87 | 0.91 | 0.92 | 0.92 | 0.93 | 0.91 | 0.94 | 0.95 |
0.97
| 0.95 | 0.92 | |
Naïve Bayes | PAMAP2 | 0.4 | 0.6 | 0.63 | 0.62 | 0.68 | 0.61 | 0.71 | 0.72 |
0.75
| 0.73 | 0.64 |
Landstat | 0.82 | 0.85 | 0.87 | 0.86 | 0.91 | 0.85 | 0.92 | 0.92 |
0.93
| 0.92 | 0.89 | |
Mashup | 0.25 | 0.31 | 0.34 | 0.33 | 0.48 | 0.32 | 0.52 | 0.52 |
0.55
| 0.53 | 0.37 | |
SIDO | 0.84 | 0.87 | 0.88 | 0.88 | 0.91 | 0.87 | 0.92 | 0.92 |
0.93
|
0.93
| 0.9 | |
AdaBoostM1 | PAMAP2 | 0.42 | 0.61 | 0.65 | 0.64 | 0.71 | 0.63 | 0.74 | 0.75 |
0.76
| 0.75 | 0.66 |
Landstat | 0.83 | 0.87 | 0.89 | 0.88 | 0.92 | 0.87 | 0.93 | 0.93 |
0.94
| 0.93 | 0.9 | |
Mashup | 0.26 | 0.32 | 0.36 | 0.35 | 0.5 | 0.34 | 0.52 | 0.53 |
0.57
| 0.55 | 0.39 | |
SIDO | 0.86 | 0.9 | 0.9 | 0.89 | 0.92 | 0.89 | 0.94 | 0.94 |
0.95
| 0.94 | 0.91 | |
MultiLayer Perceptron | PAMAP2 | 0.41 | 0.6 | 0.64 | 0.63 | 0.7 | 0.62 | 0.73 | 0.74 |
0.75
| 0.74 | 0.65 |
Landstat | 0.83 | 0.86 | 0.88 | 0.87 | 0.92 | 0.86 | 0.92 | 0.92 |
0.94
| 0.92 | 0.9 | |
Mashup | 0.26 | 0.31 | 0.35 | 0.34 | 0.49 | 0.33 | 0.53 | 0.53 |
0.56
| 0.54 | 0.38 | |
SIDO | 0.85 | 0.89 | 0.89 | 0.88 | 0.91 | 0.88 | 0.92 | 0.93 |
0.94
|
0.94
| 0.9 | |
Overall average | 0.59 | 0.67 | 0.70 | 0.69 | 0.76 | 0.68 | 0.78 | 0.78 |
0.80
| 0.79 | 0.71 |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | Yeast | 0.67 | 0.74 | 0.76 | 0.75 | 0.82 | 0.74 | 0.83 | 0.84 |
0.85
| 0.85 | 0.77 |
Car | 0.90 | 0.93 | 0.94 | 0.94 | 0.94 | 0.93 | 0.95 |
0.96
|
0.96
|
0.96
| 0.93 | |
KEGG-U | 0.87 | 0.90 | 0.91 | 0.90 | 0.92 | 0.89 | 0.93 | 0.93 |
0.95
| 0.94 | 0.92 | |
Naïve Bayes | Yeast | 0.62 | 0.68 | 0.71 | 0.70 | 0.79 | 0.69 | 0.80 | 0.79 |
0.81
| 0.80 | 0.76 |
Car | 0.86 | 0.89 | 0.90 | 0.90 | 0.91 | 0.89 | 0.92 | 0.92 |
0.94
| 0.93 | 0.90 | |
KEGG-U | 0.82 | 0.85 | 0.86 | 0.86 | 0.88 | 0.85 | 0.89 | 0.89 |
0.90
| 0.88 | 0.87 | |
AdaBoostM1 | Yeast | 0.65 | 0.72 | 0.75 | 0.74 | 0.81 | 0.73 | 0.82 | 0.82 |
0.83
| 0.82 | 0.76 |
Car | 0.89 | 0.92 | 0.93 | 0.93 | 0.93 | 0.92 | 0.94 |
0.95
|
0.95
| 0.94 | 0.93 | |
KEGG-U | 0.85 | 0.88 | 0.89 | 0.89 | 0.91 | 0.88 | 0.91 | 0.91 |
0.92
| 0.90 | 0.90 | |
MultiLayer Perceptron | Yeast | 0.64 | 0.70 | 0.73 | 0.72 | 0.80 | 0.71 | 0.81 | 0.81 |
0.82
| 0.81 | 0.78 |
Car | 0.88 | 0.90 | 0.92 | 0.91 | 0.92 | 0.90 | 0.94 | 0.94 |
0.96
| 0.94 | 0.92 | |
KEGG-U | 0.84 | 0.87 | 0.88 | 0.88 | 0.90 | 0.87 | 0.91 | 0.90 |
0.91
| 0.89 | 0.88 | |
Overall average | 0.79 | 0.83 | 0.84 | 0.84 | 0.87 | 0.83 | 0.88 | 0.88 |
0.9
| 0.88 | 0.86 |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | Yeast | 0.67 | 0.75 | 0.77 | 0.76 | 0.84 | 0.75 | 0.85 | 0.86 |
0.87
| 0.86 | 0.79 |
Car | 0.90 | 0.93 | 0.95 | 0.94 | 0.96 | 0.93 | 0.96 |
0.97
|
0.97
| 0.96 | 0.95 | |
KEGG-U | 0.87 | 0.92 | 0.93 | 0.93 | 0.94 | 0.92 | 0.94 | 0.94 |
0.96
| 0.94 | 0.93 | |
Naïve Bayes | Yeast | 0.63 | 0.71 | 0.73 | 0.72 | 0.80 | 0.71 | 0.81 | 0.82 |
0.83
| 0.81 | 0.75 |
Car | 0.87 | 0.90 | 0.91 | 0.90 | 0.93 | 0.90 | 0.93 | 0.93 |
0.94
| 0.93 | 0.92 | |
KEGG-U | 0.83 | 0.87 | 0.89 | 0.89 | 0.89 | 0.88 | 0.90 | 0.90 |
0.91
| 0.89 | 0.88 | |
AdaBoostM1 | Yeast | 0.65 | 0.73 | 0.75 | 0.74 | 0.82 | 0.74 | 0.83 | 0.84 |
0.85
| 0.83 | 0.77 |
Car | 0.89 | 0.92 | 0.93 | 0.93 | 0.95 | 0.92 | 0.95 |
0.96
|
0.96
| 0.95 | 0.94 | |
KEGG-U | 0.85 | 0.90 | 0.91 | 0.91 | 0.92 | 0.90 | 0.92 | 0.92 |
0.94
| 0.92 | 0.91 | |
MultiLayer Perceptron | Yeast | 0.64 | 0.72 | 0.74 | 0.73 | 0.81 | 0.72 | 0.82 | 0.83 |
0.84
| 0.82 | 0.76 |
Car | 0.88 | 0.91 | 0.92 | 0.91 | 0.94 | 0.91 | 0.94 | 0.94 |
0.95
| 0.94 | 0.93 | |
KEGG-U | 0.84 | 0.88 | 0.90 | 0.90 | 0.90 | 0.89 | 0.91 | 0.91 |
0.92
| 0.90 | 0.89 | |
Overall average | 0.79 | 0.84 | 0.86 | 0.85 | 0.89 | 0.84 | 0.89 | 0.9 |
0.91
| 0.89 | 0.86 |
Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | |
Yeast | 1484 | 3292 | 3092 | 3476 | 3345 | 3189 | 3069 | 2725 | 2463 | 2991 | 2888 |
Car | 1728 | 4326 | 4289 | 4716 | 4489 | 4167 | 4174 | 3525 | 3154 | 4089 | 3711 |
KEGG-U | 65,554 | 189,651 | 177,651 | 194,511 | 192,489 | 171,551 | 175,167 | 123,544 | 117,629 | 156,233 | 140,013 |
Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | |
Yeast | 1484 | 8410 | 7921 | 8643 | 8550 | 8223 | 7703 | 4614 | 3106 | 7615 | 7023 |
Car | 1728 | 4867 | 4699 | 5003 | 4980 | 4421 | 4609 | 3589 | 3472 | 4581 | 3947 |
KEGG-U | 65,554 | 199,326 | 184,559 | 201,004 | 200,665 | 195,438 | 186,598 | 128,451 | 123,486 | 164,878 | 149,532 |
Data set | Over_sampling techniques | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | C | D | E | F | G | H | I | J | K | |||||||||||
%RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| %RDD
| %RDA
| |
Yeast | 84.4 | 1.34 | 87.69 | 1.3 | 85.27 | 1.32 | 87.51 | 2.41 | 88.22 | 1.34 | 86.03 | 2.38 | 51.47 | 2.35 | 23.09 | 2.32 | 87.19 | 1.16 | 83.44 | 2.56 |
Car | 11.76 | 2.12 | 9.12 | 1.05 | 5.91 | 0 | 10.37 | 2.11 | 5.91 | 0 | 9.91 | 1.04 | 1.79 | 1.03 | 9.59 | 1.03 | 11.34 | 0 | 6.16 | 2.12 |
KEGG-U | 4.97 | 2.19 | 3.81 | 2.17 | 3.28 | 3.27 | 4.15 | 2.15 | 13.01 | 3.31 | 6.31 | 1.07 | 3.89 | 1.06 | 4.85 | 1.04 | 5.38 | 1.05 | 6.57 | 1.08 |
Average | 34.73 | 1.88 | 33.54 | 1.13 | 31.48 | 1.53 | 34.01 | 2.22 | 35.71 | 1.55 | 34.08 | 1.49 | 19.05 | 1.11 | 12.51 | 1.1 | 34.64 | 0.55 | 32.6 | 1.44 |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | MiniBoone | 0.82 | 0.84 | 0.84 | 0.84 | 0.87 | 0.84 | 0.90 | 0.89 |
0.91
| 0.90 | 0.85 |
Credit card | 0.72 | 0.75 | 0.78 | 0.77 | 0.85 | 0.76 | 0.87 | 0.87 |
0.89
| 0.88 | 0.79 | |
RLCP | 0.24 | 0.27 | 0.28 | 0.28 | 0.69 | 0.27 | 0.73 | 0.73 |
0.74
| 0.72 | 0.42 | |
Naïve Bayes | MiniBoone | 0.79 | 0.83 | 0.83 | 0.83 | 0.85 | 0.83 | 0.86 | 0.86 |
0.89
| 0.87 | 0.83 |
Credit card | 0.69 | 0.74 | 0.75 | 0.75 | 0.82 | 0.74 | 0.84 | 0.84 |
0.85
| 0.84 | 0.76 | |
RLCP | 0.23 | 0.25 | 0.26 | 0.26 | 0.66 | 0.25 | 0.68 |
0.70
|
0.70
|
0.70
| 0.40 | |
AdaBoostM1 | MiniBoone | 0.80 | 0.83 | 0.84 | 0.84 | 0.86 | 0.83 | 0.88 | 0.87 |
0.90
| 0.88 | 0.84 |
Credit card | 0.70 | 0.74 | 0.76 | 0.75 | 0.82 | 0.75 | 0.85 |
0.86
|
0.86
|
0.86
| 0.78 | |
RLCP | 0.23 | 0.26 | 0.27 | 0.27 | 0.68 | 0.26 | 0.71 | 0.71 |
0.72
| 0.70 | 0.41 | |
MultiLayer Perceptron | MiniBoone | 0.79 | 0.83 | 0.83 | 0.83 | 0.85 | 0.83 | 0.87 | 0.87 |
0.89
| 0.87 | 0.83 |
Credit card | 0.69 | 0.74 | 0.76 | 0.75 | 0.82 | 0.75 | 0.85 | 0.85 |
0.86
| 0.85 | 0.77 | |
RLCP | 0.23 | 0.25 | 0.26 | 0.26 | 0.67 | 0.25 | 0.70 |
0.71
|
0.71
| 0.70 | 0.40 | |
Overall average | 0.58 | 0.61 | 0.62 | 0.62 | 0.79 | 0.61 | 0.81 | 0.81 |
0.83
| 0.81 | 0.67 |
Classifier | Data set | Over_sampling techniques | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | I | J | K | ||
Random Forest | MiniBoone | 0.86 | 0.90 | 0.92 | 0.92 | 0.93 | 0.91 | 0.94 | 0.94 |
0.96
| 0.95 | 0.92 |
Credit card | 0.78 | 0.83 | 0.86 | 0.85 | 0.88 | 0.84 | 0.91 | 0.89 |
0.92
| 0.91 | 0.85 | |
RLCP | 0.40 | 0.48 | 0.50 | 0.49 | 0.77 | 0.48 | 0.79 | 0.79 |
0.81
| 0.80 | 0.51 | |
Naïve Bayes | MiniBoone | 0.85 | 0.87 | 0.90 | 0.91 | 0.91 | 0.90 | 0.92 | 0.92 |
0.93
| 0.92 | 0.90 |
credit card | 0.76 | 0.80 | 0.82 | 0.82 | 0.87 | 0.81 | 0.89 | 0.87 |
0.90
|
0.90
| 0.83 | |
RLCP | 0.37 | 0.46 | 0.47 | 0.47 | 0.71 | 0.46 | 0.76 | 0.76 |
0.78
| 0.77 | 0.48 | |
AdaBoostM1 | MiniBoone | 0.85 | 0.89 | 0.91 | 0.90 | 0.92 | 0.89 | 0.93 | 0.93 |
0.94
|
0.94
| 0.91 |
Credit card | 0.77 | 0.81 | 0.84 | 0.83 | 0.89 | 0.82 | 0.90 | 0.88 |
0.91
| 0.90 | 0.85 | |
RLCP | 0.39 | 0.47 | 0.48 | 0.48 | 0.73 | 0.47 | 0.77 | 0.78 |
0.80
|
0.80
| 0.50 | |
MultiLayer Perceptron | MiniBoone | 0.85 | 0.88 | 0.91 | 0.90 | 0.91 | 0.89 | 0.92 | 0.92 |
0.93
|
0.93
| 0.91 |
Credit card | 0.76 | 0.81 | 0.83 | 0.83 | 0.86 | 0.82 | 0.88 | 0.88 |
0.90
|
0.90
| 0.84 | |
RLCP | 0.38 | 0.47 | 0.47 | 0.47 | 0.73 | 0.47 | 0.77 | 0.77 |
0.79
|
0.79
| 0.49 | |
Overall average | 0.66 | 0.72 | 0.74 | 0.73 | 0.84 | 0.73 | 0.86 | 0.86 |
0.88
| 0.87 | 0.74 |
Data set | Number of mappers | |||
---|---|---|---|---|
8 | 16 | 32 | 64 | |
KEGG-U | 0.85 | 0.84 | 0.82 | 0.81 |
PAMAP2 | 0.76 | 0.74 | 0.73 | 0.72 |
Mashup | 0.54 | 0.53 | 0.52 | 0.50 |
Number of mappers | Over_sampling techniques | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
B | C | D | E | F | G | H | I | J | K | |
8 | 2:21:37 | 3:09:44 | 3:04:53 | 4:11:26 | 4:02:42 | 3:21:08 | 3:11:42 | 3:19:07 | 3:22:21 | 3:23:03 |
16 | 2:10:32 | 2:16:39 | 2:14:38 | 3:10:49 | 3:07:55 | 2:23:07 | 2:18:09 | 2:22:56 | 3:01:48 | 3:03:38 |
32 | 1:23:20 | 2:04:21 | 2:02:16 | 2:20:11 | 2:17:23 | 2:06:17 | 2:03:32 | 2:04:51 | 2:12:34 | 2:15:02 |
64 | 1:17:56 | 1:22:47 | 1:21:38 | 2:12:53 | 2:09:43 | 2:02:34 | 1:23:21 | 2:01:14 | 2:05:41 | 2:07:29 |