Introduction
Related work
Background
PCA
ICA
Autoencoder
PCA vs Autoencoder
Tomek links
Methodology
Development of Models
Metrics
-
\(\textit{True Positive} \hbox { (TP)}\) is the number of positive samples correctly identified as positive.
-
\(\textit{True Negative}\hbox { (TN)}\) is the number of negative samples correctly identified as negative.
-
\(\textit{False Positive} \hbox { (FP)}\), also known as Type I error, is the number of negative instances incorrectly identified as positive.
-
\(\textit{False Negative} \hbox { (FN)}\), also known as Type II error, is the number of positive instances incorrectly identified as negative.
Predicted Class | ||
---|---|---|
Positive | Negative | |
Actual class | ||
Positive | True Positive (TP) | False Negative (FN) (Type II error) |
Negative | False Positive (FP) (Type I error) | True Negative (TN) |
-
Recall, also known as \(\textit{True Positive Rate} \hbox { (TPR)}\) or sensitivity, is equal to \(\hbox {TP}\)/(\(\hbox {TP} + \hbox {FN}\)).
-
Specificity, also known as \(\textit{True Negative Rate} \hbox { (TNR)}\), is equal to \(\hbox {TN}/(\hbox {TN} + \hbox {FP})\).
-
Fall-out, also known as \(\textit{False Positive Rate} \hbox { (FPR)}\), is equal to \(\hbox {FP}/(\hbox {TN} + \hbox {FP})\).
-
Miss Rate, also known as \(\hbox {FNR}\), is equal to \(\hbox {FN}/(\hbox {TP} + \hbox {FN})\).
-
\(\hbox {AUC}\) graphically shows recall versus (1-specificity), or \(\hbox {TPR}\) vs \(\hbox {FPR}\), across all classifier decision thresholds [53]. From this curve, the \(\hbox {AUC}\) obtained is a single value that ranges from 0 to 1, with a perfect classifier having a value of 1.In our study, we use more than one performance metric (recall, \(\hbox {FNR}\), \(\hbox {AUC}\)). This strategy allows us to better understand the challenge of evaluating the machine learning algorithms with highly imbalanced data.
Results and discussion
Algorithm | Noisy instances | AUC | FNR | Recall |
---|---|---|---|---|
Autoencoder | 452 | 0.96 | 0.10 | 0.90 |
352 | 0.95 | 0.12 | 0.88 | |
201 | 0.95 | 0.14 | 0.86 | |
51 | 0.94 | 0.18 | 0.82 | |
ICA | 452 | 0.94 | 0.14 | 0.84 |
352 | 0.93 | 0.15 | 0.82 | |
201 | 0.92 | 0.17 | 0.83 | |
51 | 0.91 | 0.18 | 0.81 | |
PCA | 452 | 0.94 | 0.13 | 0.83 |
352 | 0.93 | 0.15 | 0.81 | |
201 | 0.93 | 0.17 | 0.81 | |
51 | 0.90 | 0.18 | 0.80 | |
Tomek links | 452 | 0.90 | 0.50 | 0.54 |
352 | 0.89 | 0.50 | 0.53 | |
201 | 0.90 | 0.53 | 0.55 | |
51 | 0.88 | 0.60 | 0.62 |
Factor | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
---|---|---|---|---|---|
Algorithms | 3 | 8.840 | 2.9466 | 7066.09 | <2e-16 |
Noise level | 3 | 0.081 | 0.0270 | 64.77 | <2e-16 |
Interaction | 9 | 0.252 | 0.0280 | 67.16 | <2e-16 |
Residuals | 624 | 0.260 | 0.0004 |
Factor | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
---|---|---|---|---|---|
Algorithms | 3 | 17.833 | 5.944 | 13230.69 | <2e-16 |
Noise level | 3 | 0.522 | 0.174 | 387.55 | <2e-16 |
Interaction | 9 | 0.114 | 0.013 | 28.14 | <2e-16 |
Residuals | 624 | 0.280 | 0.000 |
Factor | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
---|---|---|---|---|---|
Algorithms | 3 | 1.0855 | 0.3618 | 1366.760 | <2e-16 |
Noise level | 3 | 0.1539 | 0.0513 | 193.787 | <2e-16 |
Interaction | 9 | 0.0041 | 0.0005 | 1.733 | 0.0783 |
Residuals | 624 | 0.1652 | 0.0003 |
Metric | Autoencoder | ICA | PCA | Tomek links |
---|---|---|---|---|
Recall | a | b | c | d |
FNR | a | b | b | c |
AUC | a | b | c | d |