1 Introduction
2 Background
2.1 The class imbalance problem
2.2 Strategies to mitigate class imbalance
-
Data-level strategies, also known as external strategies, focus on modifying the dataset to rebalance the class distribution. Techniques such as oversampling, undersampling, and hybrid methods are employed. Oversampling methods generate synthetic examples or replicate existing instances of the minority class to augment its representation. Conversely, undersampling methods reduce the number of examples from the majority class to achieve a more balanced dataset. Hybrid methods combine oversampling and undersampling techniques to achieve the desired class distribution. These modifications are typically performed as a preprocessing step to ensure improved model performance (Fernández et al. 2018).
-
Algorithm-level strategies, also called internal strategies, involve adapting the learning algorithms to assign greater importance to the minority class. These strategies require a deeper understanding of the model and the application domain to identify why the model fails when under imbalanced class distributions (Fernández et al. 2018).
-
Cost-sensitive strategy considers the varying costs associated with misclassifications across different classes. This strategy lies between data-level and algorithm-level strategies. They can operate at the data level by assigning costs to individual instances or at the algorithm level by incorporating cost considerations into the learning process (López et al. 2013; Fernández et al. 2018).
-
Ensemble-based strategies combine multiple base learners to create a more accurate and robust classification model. These strategies can be adapted to handle imbalanced datasets in two ways. Firstly, the ensemble learning algorithm can be modified at the data level, enabling preprocessing steps to be performed on the data before the learning stage of each classifier (López et al. 2013; Fernández et al. 2018). Alternatively, a cost-sensitive framework can be incorporated to build cost-sensitive ensembles. Rather than altering the base classifier to accept costs during the learning process, cost-sensitive ensembles are designed to guide the cost minimisation procedure through the ensemble learning algorithm (López et al. 2013; Fernández et al. 2018). Galar et al. (2012) present a comprehensive taxonomy of ensemble methods for learning with imbalanced classes in their review. The authors predominantly categorise these ensemble strategies into four distinct families. The first family encompasses cost-sensitive boosting methods, while the remaining three families incorporate data preprocessing techniques and are further classified based on the ensemble learning algorithm employed, namely boosting, bagging, and hybrid ensembles.
Strategy | Strengths | Studies | Weaknesses | Studies |
---|---|---|---|---|
Data-level strategies (oversampling) | Easy implementation | Potential overfitting | ||
Versatile (Independent of the algorithm) | Longer training time | |||
Straightforward | Kaur et al. (2019) | |||
Data-level strategies (undersampling) | Flexible | Loss of information | ||
Algorithm-level strategies | Do not cause any shifts in the data distribution | Require a deep understanding of the algorithm | ||
More directed alleviation of the imbalance problem | Fernández et al. (2018) | Reduced flexibility | ||
Less likely to impact training time | Johnson and Khoshgoftaar (2019) | More difficult to design and implement than data-level strategies | Fernández et al. (2018) | |
CSL | Computationally efficient | Misclassification costs are unknown | ||
Preserves the data distribution | Kaur et al. (2019) | Risk of overfitting when searching for optimal costs | ||
Ensemble-based strategies | Multiple classifiers provide better prediction than a single classifier | The output model can be difficult to interpret | López et al. (2013) | |
More resilience to noise (decreased variance) | Elrahman and Abraham (2013) | Computational complexity | ||
Improved generalizability |
2.3 CSL
2.3.1 Overview
Actual negative | Actual positive | |
---|---|---|
Predicted negative | C(0,0) | \(C_{p}\)=C(0,1) |
Predicted positive | \(C_{n}\)=C(1,0) | C(1,1) |
2.3.2 Illustrative example
Total instances | Cancer instances | Non-cancer instances | |
---|---|---|---|
Training set | 686 | 44 | 642 |
Test set | 172 | 11 | 161 |
Actual non-cancer | Actual cancer | |
---|---|---|
Predicted non-cancer | 0 | 15 |
Predicted cancer | 1 | 0 |
Metric | Cost-insensitive model | Cost-sensitive model |
---|---|---|
Accuracy | 93.6% | 94.8% |
Precision | 50% | 56.2% |
Sensitivity | 54.5% | 81.8% |
F1 score | 52.2% | 66.6% |
AUC | 89.2% | 95.4% |
3 Methodology
3.1 Research questions
ID | RQ | Rationale |
---|---|---|
RQ1 | In which years, publication channels, and sources were the selected papers published? | To explore the historical publication trends and identify the different channels and sources in which the selected papers were published |
RQ2 | What types of research were published? | To determine the research types presented in the selected studies |
RQ3 | Which empirical methods are used to evaluate cost-sensitive models in medicine? | To examine the types of empirical validation performed to evaluate cost-sensitive models in medicine |
RQ4 | In which disciplines of medicine was CSL mainly employed? | To determine the medical disciplines in which CSL was applied |
RQ5 | Which medical tasks are addressed in the selected papers? | To identify the medical tasks for which CSL was used |
RQ6 | Which CSL approaches were most frequently used in medicine? | To identify the most frequent CSL approaches in the medical literature |
RQ7 | What are the strengths and weaknesses of cost-sensitive methods in medicine? | To point out the strengths and limitations of cost-sensitive techniques in medicine |
RQ8 | What are the frequently used medical datasets, data types, and metrics to assess the performance of cost-sensitive models? | To determine the most employed datasets, data types, and metrics to assess the performance of cost-sensitive models in medicine |
RQ9 | Which development tools are used for cost-sensitive techniques’ implementation? | To identify the development tools employed to implement cost-sensitive techniques |
3.2 Search strategy
Scope | Search terms |
---|---|
Medicine | Health* OR Medic* OR Disease OR Clinic* |
AND Artificial Intelligence | “Machine Learning” OR “Deep Learning” OR Intelligen* OR Classif* OR Predict* OR Diagnos* OR Prognos* |
AND Technique | Technique OR Method OR Tool OR Model OR Algorithm OR Approach OR Framework |
AND CSL | “Cost sensitive” OR Cost-sensitive OR “weighted cost function” OR “weighted loss function” OR “class weighting” OR re-weighting |
AND Imbalance | Imbalance* OR unbalance* OR “skewed class distribution” OR under-represented OR “majority class” OR “minority class” |
3.3 Study selection
Inclusion criteria | Exclusion criteria |
---|---|
IC1: Studies developing new or using existing cost-sensitive techniques in medicine | EC1: Papers published earlier than January 2010 or later than December 2022 |
IC2: Papers focusing mainly on cost-sensitive models in medicine, whether or not comparing them to other balancing techniques | EC2: Papers using several datasets from multiple areas with a mere presence of medical ones |
IC3: Papers presenting fair comparisons of several balancing techniques in medicine, including cost-sensitive methods | EC3: Papers using cost-sensitive techniques in public health, biology, pharmacology, or genomics |
IC4: Papers presenting comparisons between CSL methods in medicine without proposing any newly developed techniques | EC4: Papers available as abstracts, posters, book chapters (excluded due to potential duplication with previously published conference or journal papers), or presentations |
IC5: Papers providing an overview of studies investigating cost-sensitive methods in medicine | EC5: Non-peer-reviewed papers |
IC6: Papers combining cost-sensitive methods with other balancing techniques in medicine | EC6: Duplicate publications of the same study |
EC7: Studies published in languages other than English | |
EC8: Short papers | |
EC9: Papers for which the full texts are not available |
3.4 Quality assessment
ID | Questions | Possible answers and scoring |
---|---|---|
QA1 | Does the study give clear empirical results? | Yes (+1), No (+0) |
QA2 | Does the study give a justified empirical design? | Yes (+1), No (+0), Partially (+0.5) |
QA3 | Does the study evaluate the performance of the developed solution? | Yes (+1), No (+0), Partially (+0.5) |
QA4 | Is the proposed solution in the study compared to other solutions? | Yes (+1), No (+0) |
QA5 | Does the study explicitly present the proposed method’s benefits and limitations? | Yes (+1), No (+0), Partially (+0.5) |
QA6 | Is the study published in a recognised source? | For conferences and workshops: Core2021: A/A* (+1.5), B (+1), C (+0.5), No Rank (+0) For journals: JCR2021: Q1 (+2), Q2 (+1.5), Q3 (+1), No Rank (+0) |
3.5 Data extraction strategy and synthesis
Study identifier | |
Title | |
Publication year | |
Authors | |
Abstract | |
Digital library | |
RQ1: In which years, publication channels, and sources were the selected papers published? | |
Publication years, channels (journal, conference, or workshop), and sources were extracted to address this question. | |
RQ2: What types of research were published? | |
The research types were categorised as follows: evaluation research, validation research, solution proposal, review, and others (philosophical papers, opinion papers, and experience papers) (Petersen et al. 2015). | |
RQ3: Which empirical methods are used to evaluate cost-sensitive models in medicine? | |
The empirical methods can be classified as historical-based evaluation, case study, or survey (Petersen et al. 2015). | |
RQ4: In which disciplines of medicine was CSL mainly employed? | |
Each paper was examined to determine its specific medical focus, encompassing disciplines such as oncology, cardiology, ophthalmology, and others, as detailed exhaustively in (Careers in medicine 2023). | |
RQ5: Which medical tasks are addressed in the selected papers? | |
The medical tasks can be classified into screening, diagnosis, prognosis, treatment, monitoring, and management (Esfandiari et al. 2014). | |
RQ6: Which CSL approaches were most frequently used in medicine? | |
The developed cost-sensitive methods in the selected studies were identified. These methods can be classified as either direct or meta-learning approaches. The latter could further be classified as preprocessing or postprocessing methods (Fernández et al. 2018). | |
RQ7: What are the strengths and weaknesses of cost-sensitive methods in medicine? | |
The strengths and weaknesses of CSL, CSL approaches, and some selected works were outlined. | |
RQ8: What are the frequently used medical datasets, data types, and metrics to assess the performance of cost-sensitive models? | |
The frequently used medical datasets, data types (numeric, categorical, time series, images, or text), and evaluation metrics were retrieved. | |
RQ9: Which development tools are used for cost-sensitive techniques’ implementation? | |
The reported development tools (programming language, package, or software) were identified. |
3.6 Study selection results
4 Statistical trends
4.1 Publication trends
Journal source | #Papers | Percentage |
---|---|---|
Computer Methods and Programs in Biomedicine | 9 | 5.2% |
Computers in Biology and Medicine | 8 | 4.6% |
BMC Medical Informatics and Decision Making | 5 | 2.9% |
Neurocomputing | 5 | 2.9% |
Multimedia Tools and Applications | 5 | 2.9% |
Medical Image Analysis | 4 | 2.3% |
Biomedical Signal Processing and Control | 4 | 2.3% |
Artificial Intelligence in Medicine | 3 | 1.7% |
Applied Soft Computing | 3 | 1.7% |
Other | 75 | 43.4% |
Conference source | #Papers | Percentage |
---|---|---|
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) | 5 | 2.9% |
Other | 42 | 24.3% |
Workshop source | #Papers | Percentage |
---|---|---|
International Workshop on Machine Learning in Medical Imaging (MLMI) | 3 | 1.7% |
Other | 2 | 1.2% |
4.2 Research types
4.3 Empirical types
4.4 Medical disciplines
4.5 Medical tasks
5 CSL approaches
5.1 Overview
5.1.1 Direct approaches
5.1.2 Instance weighting
5.1.3 MetaCost
5.1.4 Thresholding
5.2 The distribution of CSL approaches in the selected studies
6 Strengths and weaknesses
6.1 Strengths and weaknesses of CSL
Strengths | Studies | Weaknesses | Studies |
---|---|---|---|
Mitigates the class imbalance efficiently | Misclassification cost values are unknown | Aldraimli et al. (2021); Liu et al. (2021); Naceur et al. (2020); Afzal et al. (2013); Zhao et al. (2018); Siddiqui et al. (2020); Fernando and Tsokos (2022); Nunes et al. (2013); Naceur et al. (2019); Kumar and Thakur (2021); Zhao et al. (2022); Cao et al. (2013b); Lili et al. (2016); Ravi et al. (2022) | |
Takes into account the unequal misclassification costs in cost-sensitive problems | Risk of overfitting the under-represented classes | ||
Effective when dealing with severely class-imbalanced scenarios | |||
Does not alter the data distribution | Fan et al. (2022); Yang et al. 2021); Wang et al. (2013); Zubair and Yoon (2022); Sadeghi et al. (2022); Wang and Cheng (2021); Nunes et al. (2013); Jiang et al. (2017); Raj et al. (2021); Chamseddine et al. (2022); Mienye and Sun (2021); Zeng et al. (2021); Sheng et al. (2021); Cazañas-Gordón et al. (2022); Castro et al. (2020) | ||
Computationally efficient |
6.2 Strengths and weaknesses of CSL approaches
Approach | Strengths | Studies | Weaknesses | Studies |
---|---|---|---|---|
Direct | Many available ML libraries | Sterner et al. (2021) | Require a deep understanding of the underlying learning algorithms | |
Reduced versatility | Liu et al. (2021) | |||
Preprocessing (Weighting) | Simple | – | – | |
Flexible | Kaur et al. (2019) | |||
Do not modify the learning algorithm | ||||
Preprocessing (MetaCost) | Flexible | Fernández et al. (2018) | Additional computational steps during the training phase | |
Do not modify the learning algorithm | ||||
Postprocessing (Thresholding) | Flexible | Liu et al. (2021) | Creating a division between training and cost-sensitive evaluation | Fernández et al. (2018) |
Do not modify the learning algorithm | The number of thresholds that need to be tuned is usually no less than the number of considered labels | Liu et al. (2021) |
6.3 Strengths and weaknesses highlighted in certain selected works
Study | Task | Data type | Proposed method | CSL technique | Cost values | Advantages | Limitations |
---|---|---|---|---|---|---|---|
Zhenya and Zhang (2021) | Heart disease diagnosis | Numeric, Categorical | Weighted Voting Ensemble (Random Forest, Logistic regression, SVM, ELM and KNN) | Weighting individual classifiers | Determined based on financial costs | Better results compared to single classifiers and previous studies The limitations of a particular classifier are remedied by other classifiers Good generalization ability Closer to reality by considering misclassification costs | Longer training time State-of-the-art techniques such as DL and soft computing, which would improve the performance, are not included |
Barot and Jethva (2021a) | Breast cancer diagnosis | Numeric, Categorical | Decision trees | Integrating costs into Gini index calculation | - | Better results compared to previous studies Reduces misclassification costs More balanced performance for both classes | - |
Razzaghi et al. (2016) | Patient financial risk prediction Patient vaccination prediction following a reminder | - | Multilevel SVM | Weighting the regularization parameter C | Selected as inversely proportional to the size of each class | Improved results Reduced computational time Robust | - |
Liu et al. (2018) | Breast cancer diagnosis | Numeric | SVM | Weighting the regularization parameter C | Quantified based on misclassification consequences | Improved performance Increased specificity | Decreased sensitivity (but a better overall performance) |
Uguroglu et al. (2012) | Heart disease diagnosis | Numeric, Categorical, Time series (ECG records) | KNN | Weighting the neighbours’ votes | \(C_{n}=1\) and \(C_{p}=\frac{K}{2}+1\) where K is the number of neighbours | Better results compared to state-of-the-art methods and other ML models Can apply to a larger population High AUC scores using the least invasive, least costly and least risky tests | - |
Lee et al. (2023) | Sleep stage classification | Time series (ECG records) | A DL architecture integrating a CNN, a Bidirectional Long Short-Term Memory (Bi-LSTM), and a Soft Voting-based Ensemble | Focal loss | - | Better performance than state-of-the-art methods Solves the class imbalance problem Reduces the training time Avoids overfitting | May not work well for subjects with sleep disorders that have different sleep structures than healthy subjects No external validation was performed Potential delays if applied in online and real-time applications |
Holste et al. (2022) | Thorax disease diagnosis | Images (Chest X-ray) | A deep CNN-based model (ResNet50) | Weighted loss functions: weighted Cross-Entropy (CE) loss, Focal loss, Label-Distribution-Aware Margin loss, Influence-Balanced loss | - | Improved performance for infrequent classes | Which re-weighting method provides more significant gains appears to depend on its interaction with the loss function used |
Ashfaq et al. (2019) | Hospital readmission prediction of congestive heart failure patients | Numeric, Categorical | LSTM | Weighted CE loss | Determined based on the IR | Enhanced performance Fast training time Increased sensitivity Cost savings | Train and test data are from a single region |
Wang et al. (2018a) | Hospital readmission prediction | Time series (Numeric, Categorical) | A DL model integrating a CNN and a multilayer perception | Weighted CE loss | Determined based on the IR | Better results compared to state-of-the-art models Deployed in a real system for readmission prediction | Moderate sensitivity |
Fernando and Tsokos (2022) | Skin lesion diagnosis | Images (Dermoscopy images) | A deep CNN-based model (EfficientNet) | Dynamically weighted balanced loss function (composed of two terms: dynamically weighted CE and a regularization component equal to the entropy of Brier score) | \(C_{i}=log(\frac{n_{m}}{n_{i}})+1\) where \(n_{m}\) is the frequency of the majority class, and \(n_{i}\) is the frequency of class i | Better results compared to CE loss, weighted CE loss, and Focal loss Dynamic weighting (self-adapt its weights depending on the prediction scores) with an emphasis on hard-to-train examples Robust generalization Broad applicability (medicine and intrusion detection applications) | - |
Javidi et al. (2021) | COVID-19 early detection | Images (CT scans) | A deep CNN-based model (hybrid DenseNet and CapsNet) | Weighted loss function | \(C_{p} = 1-\frac{n_{p}}{N}\) and \(C_{n}=1-\frac{n_{n}}{N}\) where \(n_{p}\) and \(n_{n}\) are the frequencies of the positive and negatives classes, and N is the total number of samples | Improved performance Robust (even if the positive samples are 50 times less than the negative samples) Stable Fast convergence Can process large images even with a small number of training data | The output does not contain any explicit segmentation of diagnostically helpful components |
Liu et al. (2021) | Cardiovascular diseases screening | Time series (ECG records) | A deep CNN-based model (ResNet) | Thresholding | Determined by domain experts | Better results compared to other commonly used thresholding methods (rank-based thresholding, proportion-based thresholding, and fixed thresholding) Ranked among the top 10 teams in the PhysioNet/CinC challenge | Potential loss of cost information (the cost information in the cost matrix is converted to the costs for binary classification) Unreasonable predictions cannot be avoided for this multi-label ECG classification task (two labels can be predicted to coexist in a recording) Lack of interpretability |
Zhang and Shen (2011) | Alzheimer’s disease diagnosis | Images (MRI, PET), Numeric | SVM | Thresholding | Fixed costs | Improved performance Multi-stage cost-sensitive model (integrating cost-sensitivity at the feature selection and classification stages) | - |
Zhao et al. (2018) | Medical incidents detection due to look-alike sound-alike (LASA) mix-ups | Text | Logistic regression | Thresholding | Testing multiple values | Improved performance | Outperformed by resampling (perhaps due to the uncertainty and inconsistency of the cost matrix in training and testing the dataset) |
Cao et al. (2013a) | Lung nodule detection | Images (CT scans) | Adaptive Random Subspace Ensemble | Thresholding | Determined using a heuristic search strategy with G-mean as the fitness function | Better results compared to resampling and Adacost Improved generalization | - |
Reychav et al. (2019) | Cardiac patient survival prediction in emergency situations | Numeric, Categorical | Logistic regression | Thresholding | Testing multiple values | Improved performance | The optimal proportions of positive and negative samples in the training data were not computed when dividing the data into train/test The data (Israel) might not be generalizable to other areas Needs to be further validated and tested with other datasets |
Shen et al. (2022) | Sleep apnea detection | Times series (PPG signals) | A DL model integrating a deep CNN (multi-attention ResNet) and AdaCost | Weighting | Determined based on the IR | Better results compared to state-of-the-art models Effectively reduces the gap between specificity and sensitivity Lower running time Real-time detection | Sensitivity could be further enhanced Complex structure and numerous parameters |
Hsu et al. (2015) | Breast cancer risk assessment | Numeric, Categorical | Decision trees, Logistic model tree, Naïve Bayes, SVM, KNN, Radial basis function network | Weighting | Testing multiple values | Better performance than sampling and ensemble learning Perfect recall score (100%) | Reasonable precision |
Li et al. (2021) | Brain tumour classification Lung cancer staging | Images (MRI, CT scans) | A 3D Siamese network (self-supervised) | Weighting | \(C_{p}=\frac{N}{n_{p}}\) and \(C_{n}=\frac{N}{n_{n}}\) | Better results A large boost in predicting the minor class Successfully tackles class imbalance | - |
Henze et al. (2021) | Seizure detection | Time series (ECG records) | Naïve Bayes, KNN, SVM, Adaboost | Weighting | Determined based on the IR | High sensitivity Lower detection latency | High false alarm rate Data quality (improvements in the technical setting might lead to better availability and quality of the heart rate data) |
Zhang et al. (2018) | Breast cancer diagnosis Liver disease diagnosis | Numeric, Categorical | Hierarchical ELM | Weighting | Determined based on the IR | Effectively solve the class imbalance problem with small biomedical datasets Higher and more stable performance than other state-of-the-art methods Enhanced generalization | - |
Wang et al. (2013) | Survivability prognosis of breast cancer | Numeric, Categorical | Logistic regression, Decision trees | Weighting | Determined based on the IR | Better results compared to resampling and ensemble learning Higher predictive performance | - |
Sung et al. (2021) | Acute stroke diagnosis | Numeric, Categorical | Random forest, SVM, Logistic regression, Decision trees, KNN | Weighting | - | Enhanced performance | Outperformed by resampling Generalizability needs further examination (single-site study) Prehospital factors (such as mode of transportation to the hospital and diagnosis by emergency medical services) were not considered |
7 Datasets and data types
7.1 Datasets
Dataset | Source | Data type | #Instances | #Attributes | #Classes | #Papers |
---|---|---|---|---|---|---|
MIT-BIH Arrhythmia | Moody and Mark (1980) | Time series (ECG records) | 48 | - | 5 | 7 |
COVID-19 Chest X-ray | Cohen et al. (2020a) | Images (Chest X-ray), Numeric and Categorical (metadata) | 123 | 16 | 5 | 6 |
Wisconsin (Diagnostic) | Wolberg et al. (1995) | Numeric | 569 | 30 | 2 | 6 |
Pima Indians Diabetes | National Institute of Diabetes and Digestive and Kidney Diseases (1990) | Numeric | 768 | 8 | 2 | 6 |
ISIC 2019 | ISIC Challenge (2019) | Images (Dermoscopy) | 25331 | - | 8 | 5 |
Thyroid Disease | Quinlan (1987) | Numeric, Categorical | 3772 | 21 | 3 | 4 |
HAM10000 | Tschandl (2018) | Images (Dermoscopy), Numeric and Categorical (metadata) | 10015 | 5 | 7 | 4 |
Indian Liver Patient (ILPD) | Ramana and Venkateswarlu (2012) | Numeric, Categorical | 583 | 10 | 2 | 4 |
BUPA Liver Disorders | UCI Machine Learning Repository (1990) | Numeric, Categorical | 345 | 5 | 2 | 4 |
7.2 Data types
8 Performance metrics
8.1 Traditional metrics
Actual negative | Actual positive | |
---|---|---|
Predicted negative | TN | FN |
Predicted positive | FP | TP |
-
Accuracy is a standard evaluation measure in ML used to assess a model’s ability to predict class labels accurately. It is defined as the ratio of correct predictions to the total number of predictions made:Relying solely on accuracy may not be appropriate for imbalanced datasets, as it can cause misleading results, where a model that appears to perform well may, in fact, be biased towards the majority class. To avoid such bias, all the selected studies using accuracy, except one (Naseem et al. 2020), utilised complementary metrics to evaluate model performance comprehensively.$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$(9)
-
Error rate is the complement of accuracy. It quantifies the percentage of misclassified instances and is calculated as follows:$$\begin{aligned} Error \ rate = 1 - Accuracy = \frac{FP + FN}{TP + FP + TN + FN} \end{aligned}$$(10)
-
Sensitivity, also called recall or True Positive Rate (TPR), quantifies the proportion of TP predictions among all positive predictions:A high sensitivity value in medical contexts is particularly desirable as it reduces the risk of FN and ensures the correct identification of all positive cases. This holds crucial significance in medical diagnosis, where detecting all individuals with the disease (TP) is paramount, and missing a diagnosis can lead to delayed treatment and severe health complications.$$\begin{aligned} Sensitivity = \frac{TP}{TP + FN} \end{aligned}$$(11)
-
Specificity, also known as the True Negative Rate (TNR), measures the proportion of TN predictions relative to all negative predictions:In medical settings, a high specificity rating is critical to reduce the occurrence of FP and ensure accurate identification of all negative cases. FP can lead to unnecessary medical interventions or additional diagnostic procedures, highlighting the critical role of specificity in reliable medical diagnosis.$$\begin{aligned} Specificity = \frac{TN}{TN + FP} \end{aligned}$$(12)
-
Precision is a performance metric that quantifies the accuracy of positive predictions made by a model. It is computed as follows:It is worth noting that precision and sensitivity demonstrate an inverse correlation, whereby improving one metric often leads to a decline in the other. When dealing with imbalanced medical data, prioritising sensitivity at the expense of precision may result in increased FP, resulting in unwarranted medical interventions or additional tests. Thus, precision becomes a crucial metric in evaluating the performance of a model that seeks to minimise the number of FP while maximising TP.$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$(13)
-
The AUC metric quantifies a model’s ability to discern between positive and negative cases, rendering it a compelling choice for medical applications. The AUC metric ranges between 0 and 1, with a higher value indicating better overall performance. The AUC is computed as the area under the Receiver Operating Characteristic (ROC) curve, which graphically represents the model’s TPR plotted against the False Positive Rate (FPR) at varying threshold levels. The FPR can be derived as the complement of specificity:The ROC curve visually illustrates the trade-off between sensitivity and specificity across different classification thresholds. Frequently paired with the AUC metric, the ROC curve facilitates visual comparison and evaluation of diverse models’ performances. In medical research and decision-making, the ROC curve and AUC metric assume significance by aiding in selecting an optimal threshold that strikes a balance between sensitivity and specificity, catering to the specific requirements of the medical task at hand. An ideal classifier would be positioned in the top-left corner, representing a perfect balance between sensitivity and specificity. The closer the ROC curve of a model approaches this ideal point, the better its performance.$$\begin{aligned} FPR = 1 - Specificity \end{aligned}$$(14)
-
The Geometric Mean (G-Mean) is another performance metric that comprehensively evaluates a model’s accuracy by combining sensitivity and specificity. It is commonly employed in scenarios involving imbalanced datasets. G-mean is defined as the geometric mean of sensitivity and specificity:By considering sensitivity and specificity, the G-Mean offers a balanced assessment of a classifier’s performance on both minority and majority classes. It provides a reliable measure of accuracy, accounting for the occurrence of FP and FN. This metric is particularly valuable in medical datasets where the costs associated with FP and FN can vary significantly.$$\begin{aligned} G\text {-}mean = \sqrt{Sensitivity \cdot Specificity} \end{aligned}$$(15)
-
The balanced accuracy metric offers a comprehensive evaluation of a classifier’s accuracy on both positive and negative classes, taking into account both sensitivity and specificity. It is calculated as the average of sensitivity and specificity:Balanced accuracy provides an equitable assessment by considering the performance on both classes equally, addressing the potential bias towards the majority class in the traditional accuracy measure. This makes it particularly suitable for evaluating classifiers in scenarios with imbalanced datasets and enhances its clinical relevance as an evaluation criterion.$$\begin{aligned} Balanced \ Accuracy = \frac{Sensitivity + Specificity}{2} \end{aligned}$$(16)
-
F1 score, also known as the F-measure, combines precision and sensitivity to assess the overall effectiveness of a classifier. It provides a balanced evaluation by considering the model’s ability to correctly identify positive instances (precision) and capture all positive instances (sensitivity). The F1 score is calculated as the harmonic mean of precision and sensitivity, ensuring that both measures are considered equally:The F1 score finds particular utility in scenarios where both precision and sensitivity are important, such as medical diagnosis.$$\begin{aligned} F1 \ score = \frac{2 \cdot (Precision \cdot Sensitivity)}{Precision + Sensitivity} \end{aligned}$$(17)Moreover, the F-measure encompasses a range of metrics beyond the F1 score. These metrics, collectively called F\(_{\beta }\) scores, introduce a parameter \(\beta\) that allows for flexible weighting of precision and sensitivity based on specific application requirements. The F\(_{\beta }\) score is calculated using the following formula:The \(\beta\) parameter controls the relative emphasis placed on precision versus sensitivity. A higher \(\beta\) value (e.g., F2 score) favours sensitivity over precision, making it suitable when the cost of FN is significant. Conversely, a lower \(\beta\) value (e.g., F0.5 score) emphasises precision, making it appropriate when the cost of FP is more critical.$$\begin{aligned} F_{\beta } \ score = \frac{(1 + \beta ^2) \cdot (Precision \cdot Sensitivity)}{(\beta ^2 \cdot Precision) + Sensitivity} \end{aligned}$$(18)
-
The Area Under the Precision-Recall Curve (AUPRC) serves as a comprehensive measure of a classifier’s overall effectiveness in capturing positive instances across different classification thresholds. In contrast to the ROC curve, which considers the trade-off between sensitivity and specificity, the Precision-Recall (PR) curve focuses on the trade-off between precision and sensitivity (recall). The PR curve plots precision values against corresponding sensitivity values at various thresholds.The AUPRC is computed as the area under the PR curve. Spanning the interval of 0 to 1, a higher AUPRC value reflects superior performance, indicating that the classifier achieves high precision while maintaining a high sensitivity rate. This implies that the classifier accurately identifies positive instances while minimising FP. The AUPRC metric is especially beneficial with datasets exhibiting significant class imbalance or when the consequences of FN and FP differ, as commonly seen in medical applications.
-
The Matthews Correlation Coefficient (MCC) metric measures the quality of binary classifiers, taking into account TP, TN, FP, and FN. MCC is calculated using the following formula:MCC ranges from -1 to +1, where a score of 1 indicates a perfect prediction, 0 represents a random prediction, and -1 indicates a complete disagreement between the prediction and the actual label.$$\begin{aligned} MCC = \frac{(TP \cdot TN) - (FP \cdot FN)}{\sqrt{(TP + FP) \cdot (TP + FN) \cdot (TN + FP) \cdot (TN + FN)}} \end{aligned}$$(19)MCC is commonly used in fields such as bioinformatics, where imbalanced datasets and binary classification problems are prevalent (Chicco and Jurman 2020). It is considered a robust statistical measure (Sadeghi et al. 2022) as it yields a high score only when the predictions exhibit strong performance across all four categories of the confusion matrix (TP, FP, TN, and FN).
-
The Kappa score, also known as Cohen’s Kappa, is a statistical measure that assesses the level of agreement between two annotators or raters in categorical classification tasks. It considers both the accuracy of the classifier and the possibility of agreement occurring by chance. The Kappa score is computed via the subsequent formula, where \({p_0}\) is the observed agreement or accuracy (the proportion of instances where the classifier and the actual labels agree) and \({p_e}\) is the expected agreement (the agreement expected by chance alone):\(p_e\) is calculated based on the marginal probabilities of the classifier’s predictions and the true labels.$$\begin{aligned} Kappa = \frac{p_0 - p_e}{1 - p_e} = 1 - \frac{1 - p_0}{1 - p_e} \end{aligned}$$(20)The Kappa score ranges from -1 to 1, with higher values indicating a higher level of agreement between the classifier’s predictions and the true labels. A score of 1 represents a perfect agreement beyond chance, 0 indicates agreement equivalent to chance, and negative values indicate less agreement than expected by chance.The Kappa score is instrumental in situations with a class imbalance or where relying solely on accuracy can be misleading. It serves as a useful metric for assessing the consistency and reliability of categorical classifications, providing insights into the quality of annotations or the performance of classifiers compared to human annotators.
8.2 Cost-related metrics
-
The Misclassification Cost (MC) metric, sometimes called average cost (Guido et al. 2022), provides a comprehensive evaluation of a classifier’s performance by considering the potential costs associated with misclassifying instances in a classification task. Unlike traditional accuracy, which treats all misclassifications equally, the MC metric assigns specific costs to different types of errors based on their impact or significance in a given application. This performance measure enables a more informed evaluation in cost-sensitive applications, as is the case for medical ones. The MC metric is calculated as follows:$$\begin{aligned} MC = \frac{(FP \cdot C_p) + (FN \cdot C_n)}{TP + TN + FP + FN} \end{aligned}$$(21)
-
The cost curve is a graphical depiction representing a binary classifier’s performance (expected cost) over the full range of possible class distributions and misclassification costs (Drummond and Holte 2000, 2006). The y-axis corresponds to the normalised expected cost, which can be computed using the following formula:where \({P(+)}\) is the probability of an example being from the positive class.$$\begin{aligned} EC_{\text {norm}} = \frac{FN \cdot P(+)\cdot C_p + FP \cdot (1 - P(+)) \cdot C_n}{P(+) \cdot C_p + (1 - P(+)) \cdot C_n} \end{aligned}$$(22)The x-axis corresponds to the “probability times cost”, which summarises misclassification costs and class distributions in a single number:Drummond and Holte (2000) introduced cost curves as a remedy to address the limitations of ROC curves. Cost curves offer a comprehensive evaluation of classifier performance by considering specific misclassification costs, class probabilities, performance comparisons between different classifiers, average performance across multiple evaluations, confidence intervals, and statistical significance of performance differences, making them a powerful tool for decision-making in classification tasks.$$\begin{aligned} P(+)\cdot \text {cost} = \frac{P(+) \cdot C_p}{P(+) \cdot C_p + (1 - P(+)) \cdot C_n} \end{aligned}$$(23)In their notable research, Drummond and Holte (2006) provide an illustrative example that complements their prior work. The illustration showcases the cost lines associated with C4.5 decision trees and 1R models on the Japanese credit dataset, where costs are taken into consideration.
-
The weighted Kappa score (Cohen 1968) is a modified version of the Kappa score, which incorporates weights (the misclassification costs) that reflect the severity or importance of disagreement for each class based on a cost matrix. It can be formulated as follows:where \(w_{ij}\) refers to the weight associated with the value in the ith row and jth column of the confusion matrix, and \(P_{i.}\) and \(P_{.j}\) are the marginal probabilities.$$\begin{aligned} Kappa_w = \frac{\sum _{i=1}^{I} \sum _{j=1}^{I} w_{ij} \cdot P_{ij} - \sum _{i=1}^{I} \sum _{j=1}^{I} w_{ij} \cdot P_{i.} P_{.j}}{1 - \sum _{i=1}^{I} \sum _{j=1}^{I} w_{ij} \cdot P_{i.} P_{.j}} \end{aligned}$$(24)
-
Cost-Weighted Accuracy (CWA), proposed by the PhysioNet/CinC challenge (Alday et al. 2020), is a multi-class scoring metric that extends the traditional accuracy metric by incorporating cost-based weights. To calculate the cost-weighted accuracy, the prediction results are organised in a multi-class confusion matrix denoted by \(A = [a_{ij}]\), where \(a_{ij}\) represents the number of instances from class j classified as class i. The scoring is derived by performing a weighted averaging of the matrix A, where each entry is multiplied by its corresponding cost-based weight, \(w_{ij}\):Here, \(w_{ij}\) represents the cost of misclassifying an instance of class j into class i based on treatment similarities or differences in risks. The score is then normalised to range between 0 and 1, where a perfect classifier receives a score of 1 for correctly predicting the true labels, and an inactive classifier gets a score of 0 for always predicting the normal class. The scoring metric fully acknowledges and rewards accurate diagnoses while granting partial credit to misdiagnoses with similar risks or outcomes as the true diagnosis.$$\begin{aligned} CWA = \sum _{i,j} a_{ij} w_{ij} \end{aligned}$$(25)
8.3 The distribution of performance metrics in the selected studies
9 Development tools
Tool | License | #Papers |
---|---|---|
Python | Open-source | 64 |
Weka | Open-source | 14 |
MATLAB | Proprietary | 15 |
R | Open-source | 9 |
Libsvm | Open-source | 7 |
KEEL | Open-source | 2 |
Caffe | Open-source | 2 |
Java | Open-source | 1 |
RapidMiner | Commercial | 1 |
weighted_cross_entropy_with_logits
function (TensorFlow 2023a) in TensorFlow, which allows applying class weights directly to the loss calculation using the pos_weight
parameter. Furthermore, Keras offers the class_weight
parameter in its models, enabling users to set class weights during training, thereby enhancing the capability to handle class imbalance within the Keras framework. Scikit-learn also provides the class_weight
parameter in various classifiers. Additionally, XGBoost, a popular gradient boosting library in Python, offers the scale_pos_weight
parameter. By assigning a higher weight to the minority class, XGBoost ensures balanced learning and improved performance on imbalanced datasets. LightGBM, another robust gradient boosting library in Python, enables handling class imbalance through the class_weight
parameter. By assigning appropriate weights to different classes, LightGBM adjusts the impact of each class during model training, leading to enhanced performance on imbalanced data.compute_class_weight
(Scikit-learn 2023b) function provided by Scikit-learn to calculate weights, which can then be incorporated into the weight parameter of the CrossEntropyLoss
function (PyTorch 2023b), for example.Cost
, (MATLAB 2023a) which utilises cost matrices to represent misclassification costs for different classes, and ClassWeights
, (MATLAB 2023c) allowing users to assign specific weights to each class during training. These features provide flexibility in addressing class imbalance and implementing effective cost-sensitive models. Moreover, MATLAB’s efficient handling of large datasets, high-performance computing capabilities, compatibility with other programming languages, and active user community contribute to its popularity and usability in ML and data analysis. These additional advantages, combined with its support for CSL, solidify MATLAB’s position as a versatile and powerful tool for implementing cost-sensitive models.CostSensitiveClassifier
(Trigg 2023a), which enables users to transform a base classifier in Weka into a cost-sensitive model by incorporating a cost matrix during model training. The cost matrix can be conveniently specified as input, facilitating the automatic handling of class imbalance. Another available meta-classifier is MetaCost
(Trigg 2023b).Mlr
package enables users to implement cost-sensitive models through thresholding and weighting techniques (Bischl et al. 2022). Another package, Caret
, provides a unified interface for training and evaluating various ML models, offering CSL support through the train
function (Kuhn 2008). This function allows users to specify the misclassification cost for each class using the weights
argument. Similarly, the Rpart
library incorporates the weights
argument (R 2022b), allowing users to assign different weights to classes while constructing decision trees. Additionally, the LiblineaR
package permits the assignment of higher weights to instances of the minority class using the wi
argument during the development of linear models (R 2022a). Moreover, the active community support, flexibility, and seamless integration with complementary data manipulation and visualisation tools further contribute to R’s prominence in CSL research.-wi
option, which allows specific weight values to be assigned to each class during model training. Furthermore, LibSVM’s multi-language support and extensive documentation further enhance its recognition and adoption in the research community, solidifying its position as a valuable tool for researchers exploring CSL.10 Limitations
-
Selection bias: Various measures were taken to minimise potential selection bias in this review. A comprehensive search strategy was implemented, incorporating a diverse set of search terms, alternative spellings, and synonyms. The search comprised all article fields and was carried out across multiple databases, including PubMed, IEEE Xplore, Springer Link, Science Direct, and Google Scholar. The inclusion of Google Scholar was explicitly intended to retrieve papers that may not have been available in the first four libraries. Moreover, the selection criteria were rigorously defined and applied carefully to the candidate papers by one author while the remaining authors evaluated the final selection. Any disagreements between the three authors were resolved through meetings until a consensus was reached. To reduce exclusions, reasonable QA criteria were designed to ensure that papers of sufficient quality were included in the study. Besides, theoretical papers and reviews were assessed using only two non-empirical QA questions to avoid overlooking them. Despite these efforts, some limitations should be acknowledged. It is plausible that some relevant works may have been missed, specifically those published in languages other than English, in other databases, or in non-peer-reviewed sources not encompassed in the search. Additionally, snowballing (i.e. manual search in reference lists) was not conducted, which could have identified additional relevant studies.
-
Data extraction bias: In light of the critical and time-intensive nature of data extraction, a meticulous approach was taken to mitigate potential bias. One author conducted the task carefully, while the other two authors diligently reviewed the extracted data to ensure its accuracy and impartiality. Despite our efforts, some degree of subjectivity may have been introduced. Regular meetings were held to reconcile divergences and achieve a mutually agreed-upon interpretation of the data to counteract this possibility.
11 Implications for future research
-
Understanding domain-specific imbalance: Imbalanced medical datasets present substantial challenges arising from the inherent characteristics of medical data, where certain conditions exhibit significantly lower prevalence than others. To address these challenges, researchers must cultivate a profound understanding of the specific class imbalance issues within their targeted medical domain. This necessitates a comprehensive examination of the distribution patterns of medical conditions in the dataset, the identification of critical minority classes, and an exploration of the underlying factors contributing to this imbalance. Furthermore, researchers must evaluate the potential consequences of misclassification within their specific medical context. This evaluation entails a thorough consideration of the associated risks, costs, and implications associated with FN and FP.
-
Cost matrix design and evaluation: As CSL relies on the accurate estimation of misclassification costs, researchers should carefully consider the design and evaluation of the cost matrix. Collaborating with domain experts and healthcare professionals is highly valuable to define the costs associated with different types of misclassifications, especially in medical settings where the consequences of FN and FP can differ significantly. Researchers are encouraged to explore methods for cost matrix estimation, including expert opinions, data-driven approaches, and incorporating contextual factors.
-
Combining attribute and misclassification costs: While misclassification costs capture the consequences of FN and FP, attribute costs reflect the challenges associated with acquiring specific features, encompassing aspects such as financial expenses, time constraints, or the invasiveness of required tests (Fernández et al. 2018). By combining these two types of costs, researchers can develop comprehensive cost-sensitive models that simultaneously account for predictive performance and cost-efficiency in feature selection. Striking an optimal balance between the performance achieved by utilising certain features and the costs associated with their acquisition enables the development of more effective and resource-efficient models for medical decision-making.
-
Hybrid CSL: Combining CSL with other balancing strategies presents a promising avenue for addressing class imbalance in medical datasets. By integrating CSL with strategies like resampling or ensemble learning, researchers can leverage the strengths of multiple strategies to handle class imbalance and address the associated misclassification costs effectively. It is crucial, however, to gain a deep understanding of the characteristics and requirements of the specific medical dataset under investigation. This understanding allows for identifying scenarios where CSL or other balancing strategies excel individually and situations where combining them yields better results, ultimately leading to more effective and tailored solutions for imbalanced medical data.
-
Cost-sensitive evaluation: Traditional performance metrics may not fully capture the effectiveness of models when misclassification costs are unequally distributed. Researchers are strongly encouraged to expand the evaluation beyond conventional metrics and employ cost-sensitive metrics that directly incorporate the associated misclassification costs. Additionally, a comprehensive evaluation strategy should combine multiple metrics to gain a holistic understanding of the model’s performance in terms of both classification accuracy and cost-effectiveness.
-
Addressing less investigated medical disciplines and tasks: While disciplines such as oncology, cardiology, neurology, and infectious diseases have garnered significant research attention, other medical disciplines have received relatively less investigation. Similarly, diagnosis has been extensively studied, while other medical tasks remain relatively unexplored. To address this gap, researchers are urged to broaden their focus beyond the well-investigated medical sub-fields and tasks and delve into the untapped potential of less investigated medical domains. This exploration will facilitate a deeper understanding of the applicability and effectiveness of CSL in a broader range of medical applications. Furthermore, promoting data sharing is highly recommended, as limited dataset availability may have contributed to the underrepresentation of certain medical disciplines and tasks in the existing literature.
-
Advancing validation research: The scarcity of papers dedicated to validation research reflects the inherent challenges in conducting assessments of cost-sensitive methods in real-world hospital settings. Therefore, researchers must establish close collaborations with medical professionals and actively engage in validation studies to demonstrate the effectiveness and reliability of CSL methods in real medical scenarios. These validation studies can provide valuable insights into the practical performance of CSL models and enhance the trust and confidence of healthcare practitioners.
-
Ensuring generalizability: Researchers should focus on developing models that can effectively handle class imbalance across diverse datasets and healthcare contexts. This involves evaluating the performance of cost-sensitive methods on multiple datasets, encompassing different medical institutions and patient populations. Furthermore, efforts should be made to address potential sources of dataset bias, covariate shift, and concept drift to enhance the models’ generalizability to unseen data.
-
Considering interpretability: Interpretability is recognised as a critical consideration in developing cost-sensitive solutions. Guaranteeing interpretability within these models is paramount for cultivating transparency and understanding in clinical decision-making processes. Researchers are urged to prioritise the development of interpretable cost-sensitive techniques that strike a balance between model complexity and transparency. This emphasis empowers medical professionals to accurately interpret and trust the predictions made by these models.
-
Enhancing reproducibility: Researchers are encouraged to actively engage in data and code-sharing practices, fostering a collaborative environment that enables the scientific community to reproduce and validate research findings. Furthermore, it is crucial to provide detailed reports on the cost-sensitive methods employed, covering the specific cost matrix used, the chosen cost-sensitive approach, and the algorithmic configurations. Comprehensive reporting promotes transparency and facilitates comparisons between different CSL methods.
-
Broader applicability: Researchers should extend their focus beyond single-label classification and segmentation tasks and dedicate more attention to multi-label (Tarekegn et al. 2021) and regression (Wang et al. 2020a) problems. While single-label classification and segmentation have received significant attention, there is a need for comprehensive investigations and advancements in cost-sensitive methods for tasks involving multiple labels and continuous outcome prediction. By broadening the scope of CSL to encompass these diverse problems, researchers can expand the applicability of CSL in a wider range of medical scenarios.