Introduction
-
As the AUPRC score increases, the threshold-based performance scores also improve.
-
As RUS is used to increase the positive class prior probability, the optimal thresholds also increase.
-
Best overall results for the selection of an optimal threshold are obtained without the use of RUS.
-
For most metrics, the default threshold yields its best results at a balanced (1:1) class ratio.
-
However, the combination of the default threshold and balanced class ratio yields the lowest AUPRC scores for all classifiers, implying a significant tradeoff for balancing the classes.
-
The default threshold does not yield good results when the dataset is imbalanced.
Related work
Data description
Methodology
Classifier | Maximum tree depth |
---|---|
XGBoost | max_depth=1 for all class ratios |
CatBoost | max_depth=1 for 1:1, 1:3, 1:9, max_depth=5 for 1:27, 1:81 |
Random forest | max_depth=4 for all class ratios |
Extremely randomized trees | max_depth=8 for all class ratios |
Results and discussion
Classification results for the original class ratio (no RUS applied)
Classifier | AUC | AUPRC |
---|---|---|
CatBoost depth 5 | 0.9834 | 0.8592 |
Extremely randomized trees depth 8 | 0.9721 | 0.8092 |
Logistic regression | 0.9737 | 0.7586 |
Random forest depth 4 | 0.9601 | 0.8067 |
XGBoost depth 1 | 0.9775 | 0.8261 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.0018 | 0.9039 | 0.0150 | 0.0961 | 0.9850 | 0.1764 | 0.9434 | 0.2935 | 0.0982 |
F-meas NC | 0.3541 | 0.8053 | 0.0001 | 0.1947 | 0.9999 | 0.8687 | 0.8970 | 0.8717 | 0.9449 |
G-mean | 0.0015 | 0.9063 | 0.0180 | 0.0937 | 0.9820 | 0.1514 | 0.9432 | 0.2698 | 0.0829 |
G-mean NC | 0.0023 | 0.8970 | 0.0108 | 0.1030 | 0.9892 | 0.2395 | 0.9418 | 0.3476 | 0.1405 |
MCC | 0.0015 | 0.9063 | 0.0180 | 0.0937 | 0.9820 | 0.1510 | 0.9432 | 0.2695 | 0.0826 |
MCC NC | 0.0023 | 0.8970 | 0.0108 | 0.1030 | 0.9892 | 0.2395 | 0.9418 | 0.3476 | 0.1405 |
Precision | 0.0018 | 0.9039 | 0.0150 | 0.0961 | 0.9850 | 0.1764 | 0.9434 | 0.2935 | 0.0982 |
Precision NC | 0.4869 | 0.7953 | 0.0001 | 0.2047 | 0.9999 | 0 .8676 | 0.8915 | 0.8714 | 0.9560 |
C | 0.0017 | 0.9037 | 0.0147 | 0.0963 | 0.9853 | 0.1755 | 0.9435 | 0.2934 | 0.0973 |
D | 0.5000 | 0.7937 | 0.0001 | 0.2063 | 0.9999 | 0.8665 | 0.8905 | 0.8704 | 0.9559 |
Classification results for the 1:1 class ratio
Classifier | AUC | AUPRC |
---|---|---|
Random forest depth 4 | 0.9771 | 0.7347 |
Extremely randomized trees depth 8 | 0.9803 | 0.7379 |
Logistic regression | 0.9770 | 0.6030 |
XGBoost depth 1 | 0.9790 | 0.7121 |
CatBoost depth 1 | 0.9785 | 0.5928 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.3100 | 0.9281 | 0.0690 | 0.0719 | 0.9310 | 0.0458 | 0.9293 | 0.1412 | 0.0235 |
F-meas NC | 0.3760 | 0.9207 | 0.0552 | 0.0793 | 0.9448 | 0.0609 | 0.9325 | 0.1623 | 0.0317 |
G-mean | 0.2992 | 0.9301 | 0.0717 | 0.0699 | 0.9283 | 0.0440 | 0.9291 | 0.1383 | 0.0225 |
G-mean NC | 0.3783 | 0.9201 | 0.0544 | 0.0799 | 0.9456 | 0.0612 | 0.9326 | 0.1629 | 0.0319 |
MCC | 0.2921 | 0.9313 | 0.0741 | 0.0687 | 0.9259 | 0.0427 | 0.9285 | 0.1361 | 0.0219 |
MCC NC | 0.3916 | 0.9193 | 0.0524 | 0.0807 | 0.9476 | 0.0641 | 0.9331 | 0.1669 | 0.0334 |
Precision | 0.3311 | 0.9250 | 0.0627 | 0.0750 | 0.9373 | 0.0494 | 0.9310 | 0.1473 | 0.0254 |
Precision NC | 0.8357 | 0.8594 | 0.0095 | 0.1406 | 0.9905 | 0.3031 | 0.9219 | 0.3922 | 0.2045 |
C | 0.5000 | 0.9099 | 0.0339 | 0.0901 | 0.9661 | 0.0873 | 0.9375 | 0.1993 | 0.0459 |
D | 0.5000 | 0.9099 | 0.0339 | 0.0901 | 0.9661 | 0.0873 | 0.9375 | 0.1993 | 0.0459 |
Classification results for the 1:3 class ratio
Classifier | AUC | AUPRC |
---|---|---|
Random forest depth 4 | 0.9742 | 0.7313 |
Extremely randomized trees depth 8 | 0.9786 | 0.7387 |
Logistic regression | 0.9790 | 0.7090 |
XGBoost depth 1 | 0.9786 | 0.7481 |
CatBoost depth 1 | 0.9790 | 0.7374 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.1507 | 0.9217 | 0.0518 | 0.0783 | 0.9482 | 0.0584 | 0.9347 | 0.1614 | 0.0302 |
F-meas NC | 0.3901 | 0.8907 | 0.0166 | 0.1093 | 0.9834 | 0.1730 | 0.9356 | 0.2861 | 0.0975 |
G-mean | 0.1324 | 0.9260 | 0.0606 | 0.0740 | 0.9394 | 0.0508 | 0.9325 | 0.1497 | 0.0261 |
G-mean NC | 0.2208 | 0.9094 | 0.0347 | 0.0906 | 0.9653 | 0.0890 | 0.9367 | 0.2003 | 0.0470 |
MCC | 0.1317 | 0.9260 | 0.0610 | 0.0740 | 0.9390 | 0.0505 | 0.9323 | 0.1492 | 0.0260 |
MCC NC | 0.2350 | 0.9073 | 0.0326 | 0.0927 | 0.9674 | 0.0960 | 0.9367 | 0.2081 | 0.0510 |
Precision | 0.1509 | 0.9217 | 0.0517 | 0.0783 | 0.9483 | 0.0585 | 0.9348 | 0.1616 | 0.0302 |
Precision NC | 0.8467 | 0.7722 | 0.0025 | 0.2278 | 0.9975 | 0.5380 | 0.8716 | 0.5810 | 0.4962 |
C | 0.2500 | 0.9057 | 0.0276 | 0.0943 | 0.9724 | 0.1029 | 0.9383 | 0.2181 | 0.0546 |
D | 0.5000 | 0.8836 | 0.0103 | 0.1164 | 0.9897 | 0.2305 | 0.9349 | 0.3393 | 0.1332 |
Classification results for the 1:9 class ratio
Classifier | AUC | AUPRC |
---|---|---|
Random forest depth 4 | 0.9719 | 0.7418 |
Extremely randomized trees depth 8 | 0.9779 | 0.7523 |
Logistic regression | 0.9786 | 0.7298 |
XGBoost depth 1 | 0.9801 | 0.7804 |
CatBoost depth 1 | 0.9789 | 0.7680 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.0535 | 0.9183 | 0.0529 | 0.0817 | 0.9471 | 0.0571 | 0.9324 | 0.1591 | 0.0295 |
F-meas NC | 0.4056 | 0.8667 | 0.0042 | 0.1333 | 0.9958 | 0.4201 | 0.9288 | 0.4886 | 0.2820 |
G-mean | 0.0477 | 0.9217 | 0.0600 | 0.0783 | 0.9400 | 0.0510 | 0.9307 | 0.1497 | 0.0262 |
G-mean NC | 0.1089 | 0.8982 | 0.0252 | 0.1018 | 0.9748 | 0.1206 | 0.9355 | 0.2349 | 0.0651 |
MCC | 0.0475 | 0.9217 | 0.0603 | 0.0783 | 0.9397 | 0.0507 | 0.9306 | 0.1493 | 0.0261 |
MCC NC | 0.1250 | 0.8947 | 0.0215 | 0.1053 | 0.9785 | 0.1408 | 0.9355 | 0.2552 | 0.0772 |
Precision | 0.0535 | 0.9183 | 0.0529 | 0.0817 | 0.9471 | 0.0571 | 0.9324 | 0.1591 | 0.0295 |
Precision NC | 0.8815 | 0.6851 | 0.0005 | 0.3149 | 0.9995 | 0.6875 | 0.8208 | 0.7039 | 0.7591 |
C | 0.1000 | 0.8992 | 0.0249 | 0.1008 | 0.9751 | 0.1115 | 0.9362 | 0.2273 | 0.0595 |
D | 0.5000 | 0.8599 | 0.0027 | 0.1401 | 0.9973 | 0.5053 | 0.9259 | 0.5539 | 0.3597 |
Classification results for the 1:27 class ratio
Classifier | AUC | AUPRC |
---|---|---|
CatBoost depth 5 | 0.9817 | 0.7963 |
Random forest depth 4 | 0.9699 | 0.7483 |
Extremely randomized trees depth 8 | 0.9754 | 0.7693 |
Logistic regression | 0.9785 | 0.7504 |
XGBoost depth 1 | 0.9802 | 0.7885 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.0681 | 0.8819 | 0.0053 | 0.1181 | 0.9947 | 0.3856 | 0.9365 | 0.4641 | 0.2533 |
F-meas NC | 0.1711 | 0.8685 | 0.0018 | 0.1315 | 0.9982 | 0.5979 | 0.9310 | 0.6289 | 0.4584 |
G-mean | 0.0333 | 0.8977 | 0.0132 | 0.1023 | 0.9868 | 0.2238 | 0.9410 | 0.3316 | 0.1323 |
G-mean NC | 0.0734 | 0.8801 | 0.0048 | 0.1199 | 0.9952 | 0.4054 | 0.9358 | 0.4795 | 0.2699 |
MCC | 0.0333 | 0.8977 | 0.0132 | 0.1023 | 0.9868 | 0.2238 | 0.9410 | 0.3316 | 0.1323 |
MCC NC | 0.0734 | 0.8801 | 0.0048 | 0.1199 | 0.9952 | 0.4054 | 0.9358 | 0.4795 | 0.2699 |
Precision | 0.0681 | 0.8819 | 0.0053 | 0.1181 | 0.9947 | 0.3856 | 0.9365 | 0.4641 | 0.2533 |
Precision NC | 0.4621 | 0.8480 | 0.0009 | 0.1520 | 0.9991 | 0.7224 | 0.9202 | 0.7336 | 0.6446 |
C | 0.0357 | 0.8923 | 0.0091 | 0.1077 | 0.9909 | 0.2517 | 0.9402 | 0.3591 | 0.1469 |
D | 0.5000 | 0.8526 | 0.0007 | 0.1474 | 0.9993 | 0.7560 | 0.9229 | 0.7610 | 0.6812 |
Classification results for the 1:81 class ratio
Classifier | AUC | AUPRC |
---|---|---|
CatBoost depth 5 | 0.9832 | 0.8490 |
Random forest depth 4 | 0.9657 | 0.7852 |
Extremely randomized trees depth 8 | 0.9747 | 0.7915 |
Logistic regression | 0.9782 | 0.7580 |
XGBoost depth 1 | 0.9801 | 0.8057 |
Technique | Threshold | TPR | FPR | FNR | TNR | F-meas | G-mean | MCC | Precision |
---|---|---|---|---|---|---|---|---|---|
F-meas | 0.0158 | 0.8923 | 0.0081 | 0.1077 | 0.9919 | 0.2831 | 0.9406 | 0.3844 | 0.1700 |
F-meas NC | 0.1760 | 0.8553 | 0.0008 | 0.1447 | 0.9992 | 0.7489 | 0.9243 | 0.7555 | 0.6711 |
G-mean | 0.0120 | 0.8988 | 0.0122 | 0.1012 | 0.9878 | 0.2133 | 0.9421 | 0.3256 | 0.1223 |
G-mean NC | 0.0201 | 0.8872 | 0.0061 | 0.1128 | 0.9939 | 0.3458 | 0.9389 | 0.4338 | 0.2182 |
MCC | 0.0119 | 0.8990 | 0.0123 | 0.1010 | 0.9877 | 0.2123 | 0.9422 | 0.3248 | 0.1216 |
MCC NC | 0.0201 | 0.8872 | 0.0061 | 0.1128 | 0.9939 | 0.3458 | 0.9389 | 0.4338 | 0.2182 |
Precision | 0.0158 | 0.8923 | 0.0081 | 0.1077 | 0.9919 | 0.2831 | 0.9406 | 0.3844 | 0.1700 |
Precision NC | 0.4323 | 0.8347 | 0.0004 | 0.1653 | 0.9996 | 0.8107 | 0.9133 | 0.8115 | 0.7912 |
C | 0.0122 | 0.8977 | 0.0110 | 0.1023 | 0.9890 | 0.2200 | 0.9421 | 0.3328 | 0.1256 |
D | 0.5000 | 0.8333 | 0.0003 | 0.1667 | 0.9997 | 0.8268 | 0.9126 | 0.8269 | 0.8218 |