Klassenungleichgewichte stellen eine große Herausforderung im maschinellen Lernen dar, wo die Ungleichheit in der Klassenverteilung zu verzerrten und ineffektiven Modellen führen kann. Dieser Artikel präsentiert eine gründliche Überprüfung von Resampling-Ansätzen, die auf dieses Problem zugeschnitten sind, wobei der Schwerpunkt auf Oversampling, Undersampling und Hybrid-Techniken liegt. Es wird untersucht, wie sich diese Methoden an verschiedene Datenmerkmale wie Klassenüberschneidungen und kleine Disjunktionen anpassen, um die Klassifizierungsleistung zu verbessern. Der Bericht geht auch der Rolle von Komplexitätsmetriken bei der Bewertung der Klassifizierungsschwierigkeiten nach und bietet Einblicke, wie diese Metriken die Entwicklung wirksamerer Resampling-Strategien leiten können. Darüber hinaus untersucht der Artikel die Anwendung von Resampling-Techniken im Bereich des Deep Learning und zeigt ihr Potenzial zur Verbesserung der Modellleistung in komplexen und unausgewogenen Datensätzen auf. Durch das Verständnis des Zusammenspiels zwischen Klassenungleichgewichten und Datenschwierigkeitsfaktoren erhalten die Leser wertvolle Erkenntnisse darüber, wie robustere und fairere Modelle des maschinellen Lernens entwickelt werden können.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
This article presents a data-driven review of resampling approaches aimed at mitigating the class imbalance problem in machine learning, a widespread issue that limits classifier performance across numerous sectors. Initially, this research provides an extensive theoretical examination of the class imbalance problem, emphasizing its propensity to amplify existing data difficulty factors, including class overlap, small disjuncts, and noise, thus biasing the model towards the majority class. Acknowledging the significance of detecting and quantifying the synergistic effects between class imbalance and these data difficulty factors, this study surveys metrics formulated to quantify such phenomena in imbalanced domains. Subsequently, an exhaustive review of recent oversampling, undersampling, and hybrid sampling approaches is conducted. A major finding arising from this review is the discernible shift in resampling approaches towards enhanced adaptability. This is achieved through the identification of problematic regions and the subsequent implementation of customized resampling protocols. Concurrently, a methodological divergence is observed in both oversampling and undersampling strategies: certain oversampling methods target regions of higher classification complexity, which are crucial for effective model training, while others focus on areas of lower classification complexity to safely oversample the minority class. In contrast, undersampling approaches either predominantly remove majority samples from redundant regions or focus on class boundaries to reduce class overlap. However, despite this increased adaptability, no resampling method consistently demonstrated superior performance across all documented experiments. Consequently, we explore a promising strategy, namely the adoption of recommendation systems for resampling approaches. Lastly, the primary research challenges within this topic are discussed.
Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Introduction and motivation
The utilization of data and machine learning algorithms has become increasingly prevalent in contemporary society, playing a vital role in the extraction of insights and predictions from large and complex datasets. However, a common challenge encountered in the field of machine learning is the phenomenon of class imbalance, defined as a disparity in the number of samples pertaining to different classes. In view of the prevalence of error rate as the primary metric which guides classifier optimization, the presence of class imbalance can inadvertently favour the majority class, thereby diminishing the effectiveness of classification algorithms [1]. This predicament manifests across numerous sectors including, but not limited to, chemical and biochemical engineering, financial management, information technology, security, business, agriculture, and emergency management. Due to its widespread impact, class imbalance is named among the Top 10 problems in data mining [2]. Acknowledging the severity of this obstacle in the model´s fairness and utility, a multitude of data-driven and algorithmic solutions have been proposed throughout the years to attenuate the impacts of class imbalance [1, 3, 4]. Despite the diversity of available solutions, data-driven approaches seem to be the preferred strategies, mainly due to their ease of use and independence from specific learning algorithms or classification contexts. In this scenario, the success of resampling methods seems heavily contingent on their capacity to adaptively discern areas where resampling can either assist or hinder classifier performance, underlining the necessity to understand the specific characteristics of the data being analysed, such as the presence of class overlap or small disjuncts, which are commonly referred to as data difficulty factors [5].
In this research, we aim to present a comprehensive review of recent resampling approaches developed to handle class imbalance through a data-centric lens. The paper’s contributions are manifold, commencing with a detailed overview of the interplay between class imbalance and data difficulty factors in dictating classification complexity, alongside a survey of metrics that enable a quantitative evaluation of these interactions within imbalanced domains. Moreover, we provide an extremely detailed analysis of the techniques employed by recently proposed resampling approaches, paired with significant reported results. This analysis is also extended to the application of resampling in Deep Learning (DL) settings, an increasingly relevant topic in machine learning research. Complementary, we engage in a discussion on the recent trends and limitations in resampling approaches and how these increasingly hinge on classification complexity assessment to adaptively tailor their behaviour to each unique classification problem. Lastly, we explore the horizon of future research, namely the potential of complexity metrics to aid in alleviating certain shortcomings of current resampling approaches, particularly in the context of recommendations systems.
Anzeige
Most reviews in this field fail to delve into the theoretical foundation concerning the essential role of data difficulty factors in amplifying the complexity of imbalanced classification tasks [6‐9]. Without a solid understanding of this interaction, it becomes challenging to fully comprehend this issue or explore effective avenues for improving existing resampling techniques. To address this gap, the present work examines the challenges involved in identifying and quantifying these factors, a topic also often overlooked in related literature. This analysis is crucial, as contemporary resampling methods depend on identifying problematic regions within the data to adopted tailored algorithmic behaviour to each classification problem, and the insights offered in this study are pivotal for guiding the development of more robust and efficient methodologies. Additionally, the detailed examination of resampling techniques provides readers with practical insights into how these algorithms are implemented, in contrast to the more higher-level discussions found in other reviews, facilitating a clearer comparison of methods and the identification of gaps in current approaches [10]. Finally, it is noteworthy that, to the best of our knowledge, this review is among the few that explore resampling recommendation systems, highlighting its significance and potential impact on advancing the field of imbalanced learning.
This article is methodically structured as follows: In "Class imbalance: formal definition" we introduce a formal exposition of the class imbalance problem. In "Classification complexity in imbalanced domains", we delve into the data difficulty factors that interact synergistically with class imbalance, amplifying the complexity of the classification task. Additionally, we provide a survey on existing complexity metrics which can quantify the presence of these data irregularities in imbalanced domains. Finally, we present the most widely adopted performance metrics in imbalanced classification problems. "Data-driven approaches to handle class imbalance" undertakes a comprehensive review and analysis of recent literature focusing on oversampling, undersampling, and hybrid sampling techniques, respectively. Complementarily, the concurrent application of resampling and Deep Learning models is discussed at the end of "Data-driven approaches to handle class imbalance". “Resampling approaches and complexity assessment” centres on the discussion of how the assessment of classification complexity underpins the increasing adaptability of resampling strategies and the potential of resampling recommendation systems in addressing current limitations of resampling approaches. The closing section, “Conclusion”, encapsulates the key takeaways from our study and identifies potential trajectories for future research.
Class imbalance: formal definition
Imbalanced domain problems reside within a specific typology of problems belonging to the broad panorama of predictive tasks, whether these are regression or classification. Formally, the classification process involves obtaining an approximation, \(h\left({X}_{1},{X}_{2},\dots , {X}_{m}\right),\) of some undetermined function, \(Y=f({X}_{1},{X}_{2},\dots , {X}_{m})\), aimed at mapping a m-dimensional input vector to the target values. In supervised classification contexts, \(h({X}_{1},{X}_{2},\dots , {X}_{m})\) is estimated with respect to the training data, \(T=\{{\left({x}_{i},{y}_{i}\right)}_{i=1}^{n}\}\). Given this theoretical framework, the imbalance domain problem can be described according to the following assertions [1]:
1.
The context of the classification task imposes an asymmetric valuation of the predictive performance of \(h({X}_{1},{X}_{2},\dots , {X}_{m})\) relative to the target variable domain [1]. Specifically, accurate classification of the minority class is typically prioritized.
2.
One or more classes within the scope of the problem hold an unrepresentative volume of data compared to the remaining dataset, culminating in skewed learning towards the majority class/es [1, 11, 12]. As an illustration, a binary classification scenario is considered, where the majority class, typically designated as negative class in the literature, corresponds to 99% of the dataset, while the remaining instances correspond to the minority class, or positive class [11]. This distribution of classes is reflected, in most cases, in the negligence of the classifier regarding the minority class, by virtue of a learning process which seeks to maximize the overall accuracy of the model [11]. In other words, prioritizing the ability of the classifier to correctly classify the minority class will culminate in a suboptimal training loss when compared to prioritizing the majority class, generating a classifier incapable of distinguishing both classes [13]. The existence of an imbalanced dataset can be attributed to two main reasons, namely errors or biases during the data collection process or, alternatively, due to the domain in which the data are inserted [11].
This definition of imbalance problems also justifies an in-depth discussion about the concept of data representativeness, highlighted in the second point, particularly when it comes to its ability to differentiate the target variable domain. Previous studies prove that the asymmetry of the data distribution by itself does not hinder the learning process of the decision algorithms, providing that the classes are easily separable (e.g., linearly separable problem with large margins) [6, 7]. Thus, the presence of class imbalance acts to exacerbate existing data difficulty factors, such as class overlap or class noise, which in turn elevates the complexity of the classification problem and impairs the model’s capacity for generalization and predictive performance. [7, 9]. Note that, in this article, the terms classification complexity or complexity of the classification task will be used to describe the level of difficulty associated with a classification task, which is determined by the presence, interaction, and severity of various data difficulty factors (or alternatively termed data irregularities), such as class imbalance, class overlap, class noise, complex class structures, and dimensionality, among others, each of which can also be referred to individually as a complexity in the data. It is also important to emphasize that classifier performance, evaluated through metrics designed for imbalanced domains such as the F1-score or the geometric mean of class-wise accuracies, offers a global estimation of classification complexity, however, it does not provide detailed insights into the underlying data issues contributing to such complexity. Furthermore, the term complexity assessment is utilized to refer to the estimation of the presence and severity of these data difficulty factors via complexity metrics, which will be further discussed in the following sections.
Anzeige
Following the same line of reasoning regarding the interplay between class imbalance and the presence of data difficulty factors, it becomes clear that evaluating the classification complexity of an imbalanced classification problem solely through class cardinality is insufficient, as is the case of the Imbalance Rate (IR) [4]. This metric is widely used in literature and assumes the computation of the quotient between the number of samples in the majority class and the number of samples in the minority class [13]. Barella et al. measured a low correlation between the cardinality of classes and the performance of classifiers in imbalanced scenarios, by computing the g-mean in 102 datasets, corroborating the presented hypothesis [14]. Complementarily, Napierala et al. verified that certain benchmark datasets with an high imbalance ratio (50:1) were easier to learn compared to datasets with less pronounced imbalance ratio (4:1) [15]. In view of the relevance of considering the presence of data difficulty factors in the evaluation of imbalanced domain problems, we introduce a discussion of the main phenomena that contribute to the complexity of the classification task, with special emphasis on the context of imbalanced domains. We stress the fact that, since resampling methods act on altering the distribution of the data, being able to identify and quantify these data irregularities can help to adaptively remove or generate samples which boost classifier performance.
Classification complexity in imbalanced domains
Class imbalance and data difficulty factors
The review of the literature reveals four main determinants of classification complexity, all of which exhibit some form of interdependence, which are the existence of small disjuncts, noisy data, data-shifts and class overlap [4, 5, 16]. Small disjuncts correspond to the existence of different sub concepts or clusters for the same class, usually with a small number of instances. Noisy data refers to inconsistencies that negatively affect the data, being further subdivided into feature noise or class noise, the latter having the most impact on the predictive ability of the classifier [4]. As for data-shifts, it occurs when the assumption that the distributions of the training data and the test data are similar is violated, limiting the generalization capabilities of the learned models [4]. Finally, class overlap is the phenomenon most thoroughly described in the literature, corresponding to the existence of regions populated by samples of different classes [16], impairing the clear definition of a decision boundary capable of distinguishing between classes, without inducing overfitting. Although this phenomenon is conceptually intuitive, its quantification remains an open question within the scientific community, since there are no standardized guidelines for this matter [5, 16]. This fact arises on account of class overlap being composed by several different forms of complexity [16], making it difficult to define a metric capable of encapsulating and describing its nature as a whole. In an attempt to illustrate this, we present three distinct imbalanced scenarios in Fig. 1, each showcasing a unique form of class overlap, subsequently leading to diverse classification complexities. Within Fig. 1a, the overlapped region demonstrates an imbalance ratio akin to the global imbalance of the classification problem, whereas Fig. 1b presents an overlapped region with a minority sample density four times greater than the remainder of the class. Meanwhile, the overlapped region in Fig. 1c also presents a greater sample density, albeit scattered into numerous subclusters. These scenarios underscore how diverse data irregularities, such as local imbalance in Fig. 1a, b, or the presence of subclusters in Fig. 1c, as well as other data difficulty factors like highly non-linear boundaries or high data dimensionality, give rise to various types of class overlap, each with unique impacts on classifier performance [5]. Hence, we provide a brief summary describing how the various manifestations of class overlap affect classifier performance within imbalanced domains. That being said, for an in-depth discourse on this topic, we direct the reader to [5].
Fig. 1
Schematic representation of how different data characteristics, namely local imbalance in b and class decomposition in c, culminate in dissimilar types of class overlap. In all figures, the ratio between majority and minority class is approximately 16:1. As far as the imbalance in the overlap region is concerned, in b and c, the ratio is approximately 4:1. In a, the imbalance ratio is identical in all regions of the feature space. Lastly, we highlight that the cumulative overlapped region in all subplots is identical and delimited by a black rectangle
×
The synergistic effect between class imbalance and class overlap depends on the local characteristics of the data in the overlap region (local imbalance and data typology), general characteristics of the domain (class decomposition, dimensionality, decision boundary complexity) and, finally, the nature of the classifier at hand [5]. As for the first topic previously mentioned, the literature has proven that, if the overlap region presents a local distribution of classes similar to the global imbalance of the data (Fig. 1a), classifiers with a global nature (e.g., Linear Discriminant Analysis) will be much more biased towards the majority class. Conversely, classifiers with a more local nature (e.g., k-NN) are shown to exhibit superior ability in this experimental setting [5]. On the other hand, data typology, which refers to the categorization of samples based on their neighborhood, has been shown to be a major factor in increasing classification complexity. The most commonly adopted protocol for data topology analysis, formulated by Napierala et al., requires examination of the five nearest neighbors to each instance [9]. Within this schema, instances with at most one nearest neighbor from a differing class are designated as "safe". Instances with less than three opposing class neighbors fall under the "borderline" category. Samples with only one same-class neighbor are deemed "rare" and those encircled purely by contrasting class samples are identified as "outliers" [15]. In this context, the existence of a significant number of borderline samples, characterized by their disposition at the boundary between classes, limits the performance of classifiers. Shifting the focus to the general characteristics of the data, domains with a more complex structure, evidenced through the existence of highly non-linear decision boundaries, or with the existence of multiple sub concepts culminate in performance decreases, especially in tree and rule-based classifiers [5]. Lastly, we highlight that data dimensionality was shown to have a relatively small impact on classification performance, although very few studies have been conducted in this regard [5].
Complexity metrics: overview and limitations
Complexity metrics are measures designed to quantify intrinsic characteristics of the data which imply some difficulty to the classification problem, for instance its topology or local imbalance [5]. The most widely reported approach in the literature of complexity evaluation, designated Maximum Fisher's discriminant ratio (\(F1\)) [5], intents to assess the projection of the data that maximises separability between classes, as described by the following formulation:
with \({f}_{j}\) referring to each feature, where \(j=1,\dots , m\), \({\mu }_{{f}_{j}}^{c0}\) and \({\sigma }_{{f}_{j}}^{c0}\) indicating the mean and standard deviation of \({f}_{j}\) in the majority class and \({\mu }_{{f}_{j}}^{c1}\) and \({\mu }_{{f}_{j}}^{c1}\) pertaining to the mean and standard deviation of \({f}_{j}\) in the minority class. Although this metric provides a more global analysis about the overall distribution of the data, it has limited ability to assess the local properties of the data, which is extremely relevant when analysing the boundaries between classes, as they usually correspond to the most problematic regions for the classification process [5]. In this case, instance overlap features are proposed based on the premise that not all examples in the data present the same degree of classification complexity (concept typically referred to as instance hardness), thus triggering the need to introduce more granularity in the analysis of the complexity of the classification task [5]. In this context, one resorts, for instance, to the error associated to a k-NN classifier with 1 neighbour in synthetic data generated through interpolation based on the original dataset, which is commonly referred to as the N4 complexity metric. Immediately, the difficulty and heterogeneity related to the quantification of this phenomenon is evident. To further this discussion, the F1 and N4 metrics were applied to the artificial data sets presented in Fig. 1a, b. Given that both datasets possess similar means and standard deviations, differing only in the overlap region, the calculated F1 values were nearly indistinguishable, recording 0.816 and 0.828 for Fig. 1a, b, respectively. However, when the N4 metrics were employed, a significant divergence was noted, with the complexity of the Fig. 1b dataset (0.079) being more than double that observed for the Fig. 1a dataset (0.033). This increase in complexity is explicable by the greater incidence of misclassification of minority class samples within the overlap region.
These results reveal how, depending on the utilized metric and on the nature of the complexity evaluated, completely different evaluations of the complexity of the classification task are obtained, since the presented scenarios exhibit almost identical \(F1\) values and, simultaneously, quite divergent 1-NN error values. This problem stands out as one of the main focuses of research in this area, where new indicators of class overlap, which consider the different types of complexity, should be developed [5], thereby providing a more through description of the characteristics of the data. In the scope of resampling approaches, the lack of a methodology capable of holistically evaluate classification complexity and identify regions of interest can bottleneck the capabilities of these approaches, as will be discussed in subsequent sections. Additionally, the multitude of class overlap representations has hindered the definition of a standardised taxonomy that makes it feasible to categorise the different developed metrics, thus enhancing a systematic analysis and comparison of the approaches reported in the literature [5, 16]. The most frequently employed metric categorization was developed by Ho and Basu, assuming the division of complexity metrics into: \((i)\) measures of overlap between individual features, \((ii)\) measures of separability between classes and \((iii)\) measures of geometry, topology, and density of manifolds [8]. Similarly, new taxonomies have been proposed over the years to accompany the new developments in the area, among which stands out the work recently developed by M. Santos et al. establishing the classification of overlap-based complexity metrics into four distinct categories [5]:
1.
Feature Overlap—characterize the class overlap of individual features through the properties of feature distributions, usually closely related to the concept of class separability. Examples: \(F1, F2, F3, F4\)
2.
Structure Overlap—metrics associated with the internal structure of classes (data morphology). Examples: \(N1, T1, ONB, N2\).
3.
Instance-level Overlap—metrics aimed at pinpointing problematic regions in the data based on a more localized analysis of the domain's characteristics, typically conducted through nearest neighbours’ classifiers. Examples: \(N3, N4, wCM, {R}_{value}.\)
4.
Multiresolution Overlap—metrics that combine local and structural information of features, usually through recursive change of hyperparameters, neighbourhoods, etc., allowing for analysis at different resolutions. Examples: \(C1, C2\).
To conclude the discussion on class overlap, we emphasise the fact that most of the presented metrics do not acknowledge the presence of class imbalance, often being bias towards the majority class. Returning to the example in Fig. 1 and recalling that the F1 indicator depends exclusively on the mean and standard deviation of the features, it is agnostic to the proportion between the minority class and the majority class. Similarly, applying the 1-NN metric to a highly imbalanced dataset and hypothesizing a worst-case scenario of classification complexity where every minority sample is misclassified by the 1-NN, the resulting total error will nonetheless be minor. Therefore, in light of the fact that class imbalance significantly contributes to the complexity of the classification task, the majority of conventional complexity metrics are ill-equipped to provide insights in such domains.
Complexity metrics in imbalanced domains: a survey
Considering the relevance of jointly evaluating imbalance and data irregularities in scope of resampling approaches, coupled with the limitations of traditional complexity metrics, several scientific works suggested modifications to the established metrics or have introduced entirely new metrics, which are succinctly compiled in Table 1. These are usually proven to be more accurate than their traditional counterparts, by yielding a higher correlation with classification performance, usually evaluated by metrics sensitive to class imbalance, such as G-mean or F1-score [14].
Table 1
Main complexity metrics reported in the literature in imbalanced classification contexts
1. \({W}_{ij}=\left\{\begin{array}{c}\frac{d\left({x}_{i},N{N}_{k}({x}_{i})\right)-d\left({x}_{i},N{N}_{j}({x}_{i})\right)}{d\left({x}_{i},N{N}_{k}({x}_{i})\right)-d\left({x}_{i},N{N}_{1}({x}_{i})\right)} if d\left({x}_{i},N{N}_{k}({x}_{i})\right)\ne d\left({x}_{i},N{N}_{1}({x}_{i})\right) \\ 1 if d\left({x}_{i},N{N}_{k}({x}_{i})\right)=d\left({x}_{i},N{N}_{1}({x}_{i})\right)\end{array}\right.\)
Indicator based on neighbourhood analysis (k-NN) of minority samples. Firstly, a relative weight, \({W}_{ij}\), is assigned to each neighbouring sample \(j\) of sample \(i\), based on their Euclidean distance, \(d\left({x}_{i},N{N}_{j}({x}_{i})\right)\). Additionally, this distance is normalised according to the distance to the closest nearest neighbour, \(N{N}_{1}\left({x}_{i}\right),\) and the farthest nearest neighbour, \(N{N}_{k}\left({x}_{i}\right),\) in order to adapt the obtained values to each part of the domain. Lastly \(wCM\) corresponds to the percentage of minority samples whose neighbourhoods contain mostly highly weighted majority samples
Evaluation of the overlap region through the feature-wise maximum and minimum of each class (\(minmax\left({f}_{i}\right)\) and \(maxmin\left({f}_{i}\right)\)). The adaptation of this metric for imbalance scenarios assumes a normalisation of the values in accordance with the range inherent to each class, thus mitigating the propensity of the original indicator to underestimate complexity
Computation of the percentage of instances within the feature-wise overlap region, this being again determined based on a simultaneous analysis of the maxima and minima of the different classes. Given that this is a percentage analysis, it is naturally biased towards the majority class, and its adaptation foresees the individual computation of this metric for each class
Iterative selection of the most discriminative features by computing the value of \(F3\), followed by the progressive elimination of the instances that can be discriminated based on that same feature. \({T}_{i}\) signifies the resulting dataset at each sequential iteration. Finally, the adapted value of F4 will correspond to the number of instances that are not discriminated through the different features, being this value normalized by the total number of samples of the same class, thus adapting this metric to imbalanced scenarios
Adapted Fraction of points on the class boundary \((N1)\) [14]
Representation of the data according to a Minimum Spanning Tree (MST) that connects all the points of the domain as a function of their pairwise distance, \(({x}_{i}^{{c}_{i}},{x}_{j})\). At a later stage, the complexity is estimated on the basis of the quotient between the number of points connected to a distinct class sample and the total number of existing points. In order to make this method robust to class imbalance, we resort to the same strategy previously mentioned concerning the calculation of complexity for each class individually
Adapted Ratio of the intra/inter class NN distance \((N2)\) [14]
Computation, for each instance of the feature space, of the ratio between the distance of the closest sample of the same class and the closest sample of the opposite class. The decomposition of this factor for each class allows this metric to be adapted to imbalanced datasets
Adapted Leave-one-out error rate of the NN classifier \((N3)\) [14]
Metric based on the neighbourhood analysis in which the number of samples not matching the same class as the instance under study is evaluated. This value is then normalised against the total number of samples belonging to each class
Adapted Non-linearity of a 1-NN classifier \((N4)\) [14]
At first, this metric assumes the creation of new samples, \({x}_{i}^{{c}_{1}{\prime}}\), by means of random interpolation. The set of new synthetic samples is represented as \({l}_{{c}_{1}}\). Afterwards, N4 assumes the error value of an NN classifier, trained on the true samples of the dataset, and evaluated on those artificially generated samples. Note that the adaptation of this metric predicts the individual evaluation of each class, i.e. the classification error solely related to each class of interest
Adapted Fraction of maximum covering spheres \((T1)\) [14]
This indicator implies the creation of a hypersphere for each point of a given class, which holds the maximum radius possible until it reaches a sample of another class. Then, the hyper spheres contained in other hyper spheres of larger diameter are eliminated. In this manner, the T1 will correspond to the ratio between the obtained number of spheres and the number of samples of a given class
Adapted Minimized sum of error distance of a linear classifier \((L1)\) [14]
Application of a linear classifier that allows the definition of a decision boundary from which the distance of incorrectly classified samples, \({\epsilon }_{i}^{{c}_{1}}\), from a particular class is calculated
Adapted Training error of a linear classifier (L2) [14]
Similar methodology to \(N4\), where the generation of a test set by random interpolation is used. However, this metric evaluates the error of a given class by a linear classifier. Again, this metric assumes that less linearly separable classes are more difficult to learn
The \({BI}^{3}\) corresponds to the average of \(I{BI}^{3}\) values computed for all samples in the minority class. In turn, \(I{BI}^{3}\) reflects the difference between the normalized posterior probability of a bayes classifier in an imbalanced scenario, \(p\left(+|x,f\right)\), i.e., where the prior probabilities depict the asymmetry of samples from each class, and the normalized prior probabilities in a balanced scenario, \(p\left(+|x,{f}{\prime}\right),\) where identical prior probabilities are imposed on both classes. Note that this difference describes the impact of class imbalance in the decision process of the classifier
Metric based on a principle similar to \(T1\) since it intends to measure the number of hyperspheres, \({b}_{ci}\), which contain exclusively instances of the same class, that are necessary to cover the entire data space. That said, this metric resorts to a different algorithm for the definition of the hyperspheres, called Pure Class Cover Catch Diagraph. Finally, the authors of this method proposed the evaluation of the distances, from which the hyperspheres are defined, by means of the Euclidean distance (\({ONB}_{avg}^{euc}\)) and the Manhattan distance \({ONB}_{avg}^{man}\))
with \(R\left({c}_{i},{c}_{j}\right)=\frac{r\left({c}_{i},{c}_{j}\right)+r({c}_{j},{c}_{i})}{{{n}_{c}}_{i}+{n}_{{c}_{j}}}\), and \(r\left({c}_{i},{c}_{j}\right)=\sum_{m=1}^{{n}_{ci}}I(\sum_{j=1}^{k}I\left(N{N}_{j}\left({x}_{m}^{{c}_{1}}\right)\ne {c}_{1}\right)-\theta >0)\)
This indicator is grounded in a neighbourhood analysis to identify overlapping regions between classes. Specifically, for each instance, the number of samples belonging to the opposing class within its k-nearest neighbours is assessed. If this number exceeds a predefined threshold, \(\theta \), the sample is considered to be in an overlapped region. The R-value corresponds to the percentage of overlapping instances among all samples. To adapt this metric to imbalanced contexts, the R-value for the minority class is assigned a weight proportional to the imbalance ratio, thus yielding the augmented R-value
For each metric its formulation and a brief description of the associated concept is provided. To facilitate cross-referencing, the mathematical notation has been retained as closely as possible to that used in the original articles. At the same time, certain conventions were adopted to align with formalizations of previously established concepts: \(T\) refers to the training data, \(h\) denotes the learning model, \({f}_{j}^{{c}_{i}}\) corresponds to the values of feature \(i\) of class \(j\); \({n}_{{c}_{j}}\) refers to the number of samples of class \(j\); \({x}_{ji}^{{c}_{j}}\) indicates the value of feature \(j\) of instance \(i\) from class \({c}_{j}\), where if only a single index \(i\) appears in the subscript it represents the \(i\)-th sample across all its features; \(I\) is an indicator function that return 1 if the argument is true and 0 otherwise and \(N{N}_{j}({x}_{i})\) indicates the \(j\)-th nearest neighbour of \({x}_{i}\). Any additional symbols introduced in the mathematical formulations are explicitly defined within the respective concept descriptions to ensure clarity
By analysing Table 2, it becomes evident that the most commonly employed strategy to circumvent the metrics´ bias towards the majority class is to compute the metrics individually for the different classes. This is understandable, since most of the original metrics normalize the data according to the global number of samples, limiting the impact of the minority class. It is noteworthy that many resampling approaches resort to metrics similar to those depicted in Table 1 to adaptively tailor their behaviour to each classification problem, as will be further discussed in “Resampling approaches and complexity assessm”.
Table 2
Schematic representation of a confusion matrix
Actual value
Predicted value
Positive class
Negative class
Positive class
True positive (TP)
False negative (FN)
Negative class
False positive (FP)
True negative (TN)
Performance metrics in imbalanced domains
This section aims to outline the primary approaches for assessing the performance of learning models in imbalanced problems. It should be emphasized that these performance metrics differ from the complexity metrics previously addressed, as their goal is to provide a holistic perspective on the model's efficacy based on predefined user preferences. Although these can also be interpreted as a global estimate of classification complexity, they fail to offer detailed insights into the data irregularities underlying such classification complexity. That being said, performance metrics are occasionally applied in complexity assessment, such as in the L1 or N3 metric [14].
Traditional performance metrics, such as accuracy and its complement, the error rate, encounter a significant limitation in imbalanced datasets, as the metric’s value is largely driven by the most frequent class [1, 6]. This leads to biased evaluations favouring the majority class in imbalanced domains. Since, in such classification tasks, correct identification of the minority class is often of higher importance, these metrics are inadequate for assessing model performance [1, 20]. Note that, selecting a performance metric that aligns with the specific requirements of the classification problem is crucial, as it profoundly influences model selection, parameter tuning, and other aspects of the modelling pipeline [1].
To introduce the metrics commonly employed in imbalanced domains, we begin by considering a binary classification problem with one positive and one negative class. The classifier’s performance is summarized using a confusion matrix, depicted in Table 2. This matrix categorizes the samples as correctly classified—True Positives (TP) for the positive class and True Negatives (TN) for the negative class—or misclassified—False Positives (FP) and False Negatives (FN), depending on the true class [21]. Based on this foundation, we proceed to present a series of metrics tailored for imbalanced domains, which can be applied according to user-defined preferences on desired model behaviour.
The preceding formulations demonstrate that different metrics evaluate distinct aspects of a classifier’s performance. For instance, recall quantifies the proportion of positive samples correctly identified, while specificity measures the proportion of negative samples correctly classified [1]. In many experimental contexts, improving recall often leads to a decline in specificity, underscoring the importance of simultaneously tracking both metrics to strike a balance that aligns with the desired behaviour of the model [1, 20]. A similar interplay exists between other metrics, such as the true positive rate and true negative rate or precision and recall. To facilitate the analysis of such trade-offs, integrated metrics like the Fβ score and the geometric mean are adopted, which are formalized below.
The Fβ score serves as the harmonic mean between precision and recall, with the parameter β dictating the relative importance of these two measures. Values of β greater than 1 emphasize recall over precision, thereby imposing a higher penalty on false negatives than false positives. In contrast, β values below 1 place greater weight on precision, reducing the impact of false negatives [21]. A commonly used configuration is \(\beta = 1\), known as the F1-score, which assigns equal weight to precision and recall. The G-mean, on the other hand, is calculated as the geometric mean of the accuracies for both classes, offering a balanced assessment of model performance [7]. Additionally, it is worth noting other similar metrics have been proposed in the literature, such as dominance index, balanced accuracy, and mean class-weighted accuracy [1].
Another critical aspect in this context is the extension of these evaluation metrics to multiclass scenarios. This is commonly achieved by calculating metrics independently for each class and subsequently aggregating these values using one of two approaches: micro-averaging or macro-averaging. Micro-averaging averaging applies weights proportional to the frequency of each class, whereas macro-averaging gives equal importance to all classes, making it particularly advantageous in imbalanced classification settings [1]. Readers seeking a detailed mathematical formalization of metric computation in these contexts are encouraged to consult [1].
Beyond the scalar metrics previously examined, graph-based metrics are essential for assessing classifier performance, with the Receiver Operating Characteristic (ROC) curve being a preeminent example [1]. These metrics are invaluable when the problem's decision-making preferences are not predefined, as they comprehensively report performance across the full spectrum of operating conditions [1]. The ROC curve quantifies performance by evaluating classifier outcomes at different decision thresholds, which determine the probability value at which samples are assigned to the positive class (the default threshold in most classifiers is typically set at 0.5) [20, 21]. Each point on the curve represents a true positive rate and a false positive rate corresponding to a particular threshold. A significant limitation of the ROC curve lies in its inability to facilitate straightforward comparisons between classifiers. To overcome this drawback, the Area Under the ROC Curve (AUC-ROC) is widely adopted, providing a unified measure that encapsulates overall classifier performance across various operating conditions, simplifying comparative analysis [1]. Additionally, other graph-based metrics, including Precision-Recall curves, Prob AUC, Scored AUC, and Brier curves, have emerged to address certain limitations inherent to the ROC-based methodology.
Data-driven approaches to handle class imbalance
As discussed in previous sections, imbalanced data poses a significant challenge to the predictive performance of classifiers across various domains. To address this issue, several methodologies have been developed and can be broadly classified into four groups: (1) data-level approaches, (2) algorithmic-level approaches, (3) cost-sensitive approaches, and (4) ensemble learning approaches [3, 4, 22]. Data-level approaches involve modifying the data distribution to amplify the relative importance of the minority class during model training. This is achieved through random or informative resampling techniques, such as undersampling or oversampling [3, 4]. Such approaches are widely employed due to their versatility, allowing for their application irrespective of the specific learning model or classification context. In contrast, algorithmic-level approaches focus on adapting existing learning algorithms to mitigate the bias associated with the majority class, without altering the training set. Strategies encompass adjusting the decision threshold, training class-specific classifiers, weighting-based approaches, among others [3]. While algorithmic-level approaches offer enhanced model interpretability by incorporating user goals directly into the classification algorithms, they demand a comprehensive theoretical understanding of learning algorithms and optimization mechanisms [1, 22]. Moreover, they are constrained to algorithms explicitly adapted for this purpose [4].
On the contrary, cost-sensitive approaches address the issue of class imbalance by introducing asymmetric classification costs between classes into the modelling process. These costs, typically represented as a cost matrix, aim to prioritize the correct classification of minority instances during model training. However, an inherent challenge associated with this approach is the determination of context-dependent misclassification costs, which are very often unknown or difficult to ascertain [1, 22].
Lastly, ensemble learning approaches tailored to imbalanced domains combine predictions from multiple classifiers to create an ensemble model that outperforms individual base models. The success of ensemble learning relies on the diversity of the base classifiers, i.e., classifiers which generate distinct outputs for the same input, as identical classifiers would not yield enhanced results [1]. Techniques such as bagging, boosting, and utilizing resampling methods with different parameterizations for each base classifier are commonly employed to enhance diversity and reduce model variance. For a comprehensive review of ensemble learning techniques, we advise the reader to refer to [1].
The objective of this review is to provide an extensive analysis of the current literature focusing exclusively on data-driven approaches, which have gained significant popularity among data practitioners due to their simplicity, as outline above. Therefore, we will not delve into the integration of resampling methods with algorithmic-driven approaches within this study. Moreover, the implementation of Deep Learning architectures for synthetic data generation, such as GANs, will not be address in this review. Instead, our attention will be solely directed towards the examination of recently reported approaches in the literature pertaining to oversampling, undersampling, and hybrid sampling techniques. These approaches will be thoroughly discussed and evaluated in subsequent sections. To complement this analysis, we also review existing works which apply the aforementioned resampling methods in conjunction with Deep Learning models. As far as literature search methodology is concerned, the electronic databases Scopus, ScienceDirect, Google Scholar, and IEEE Xplore were utilized, using the terms "imbalanced data" along with "oversampling," "undersampling," and "hybrid sampling," according to the type of resampling method being analysed. The search was further limited to articles published after 2018. Due to the large volume of publications, especially concerning oversampling and undersampling, articles were ranked and selected based on citation count, with a minimum threshold of 10 citations. This approach ensures that the selected works are both recent and broadly recognized by the scientific community, reflecting their validation and influence on potential future research. Additionally, we aim to investigate the significance of complexity metrics in resampling methods and how they contribute to the adaptability of these techniques in domains with varying characteristics, thereby enhancing their robustness. Lastly, based on conducted analysis, we will discuss potential future research directions.
Oversampling
In the context of addressing class imbalance using resampling techniques, SMOTE has gained widespread recognition as one of the most extensively utilized oversampling techniques [23]. It introduced a paradigm shift by pioneering the creation of synthetic samples, distinguishing itself from the prevailing practice of replicating existing samples, as exemplified by Random Oversampling (ROS) [24]. The shift towards synthetic sample generation stemmed from the limitations of ROS, particularly its tendency to induce model overfitting by promoting the creation of very specific decision regions through the iterative duplication of identical samples [25]. To address this challenge, SMOTE introduces a strategy wherein the generation of new samples is achieved through interpolation. This technique leverages the proximity of samples belonging to the minority class, enabling a more accurate and comprehensive modelling of the minority class. Consequently, the model's ability to generalize improves considerably [24, 25].
Formally, SMOTE method begins by selecting the amount of oversampling to be performed, \(N\), which is typically chosen to obtain a 1:1 ratio between the majority and minority samples [24]. Afterwards, an instance of the minority class, \({x}_{i}\), is randomly selected, from which the k-nearest neighbours pertaining to the same class, \(NN({x}_{i})\), are determined, where \(k\) typically assumes the value of 5 [24]. Lastly, \(N\) of the \(k\)-nearest neighbours are selected to generate new samples, \({x}_{new}\), based on the following expression:
With \(rand\left(\text{0,1}\right)\) indicating a randomly selected number between 0 and 1. By repeatedly executing this process, the desired number of samples can be generated to attain a balanced dataset. Adjusting class frequencies to equality prevents the misclassification cost of the majority class, driven by its larger sample size, from dominating the classifier training process. This approach enhances the likelihood of correctly classifying minority class instances. Empirically, resampling has been shown to lead to statistically significant gains in classifier performance, including improvements in recall and F1-score [26].
However, despite its widespread adoption, several limitations arise in its application. Notably, these limitations stem from the inherent randomness associated with sample selection for oversampling, where samples are chosen with uniform probability. Consequently, regions with higher density of minority samples have higher probability of being further inflated and vice-versa, thus depicting certain difficulties of this method in addressing within-class imbalance and the presence of small disjuncts. Moreover, the utilization of SMOTE has the potential to magnify the impact of noise in the data by oversampling noisy samples, ultimately leading to diminished performance or overfitting. Lastly, it is worth noting that the structural characteristics of the majority class are not considered during the generation of new samples, which may result in class overlap [25]. To mitigate such issues, several SMOTE variants have been developed and are readily accessible through popular machine learning libraries such as scikit-learn [27].
In 2005, Han et al. proposed the Borderline-SMOTE method, which is based on the premise that the samples at the boundaries between classes are the most determinant in defining an adequate decision boundary, i.e., that maximises the distinction between the different classes [28]. This being the case, the algorithm initially foresees the categorization of samples as safe, danger or noise, depending on the proportion of majority samples within the nearest neighbours of the minority samples. Note that samples are considered dangerous if half of the nearest neighbours correspond to the majority class, unless all nearest neighbours originate from the contrary class, in which case they are identified as noise. Once the danger samples are determined, which are typically arranged along the border between classes, the artificial sample generation procedure is initiated based on Eq. (11) [28]. Under the same underlying hypothesis, Hguyen et al. introduced a novel technique known as SVMSMOTE [29]. This method performs oversampling of support vectors belonging to the minority class, which are obtained through the training of a SVM on the original dataset. In SVMSMOTE, the generation of new samples encompasses two distinct strategies: extrapolation and interpolation. When the number of minority neighbours exceeds that of the majority samples, extrapolation is employed, thereby inducing an expansion of the minority class region. Conversely, in scenarios where the inverse holds true, interpolation is applied to consolidate the minority class area. Importantly, in contrast to SMOTE, the selection of samples in SVMSMOTE occurs sequentially, by the evaluation of their proximity to the sample under consideration, \({x}_{i}\) [29].
In 2008, He et al. suggested the ADASYN method which diverged from SMOTE by introducing a novel strategy for adaptive generation of synthetic samples [30]. This strategy considers the density distribution of the minority class, which is evaluated based on the number of samples from the majority class present within the \(k\)-nearest-neighbours of each sample. By incorporating this density information, regions characterized by a scarcity of minority samples are prioritized, resulting in a higher number of synthetic samples being generated in such areas through SMOTE´s procedure. Consequently, ADASYN provides enhanced representation for challenging instances that are more difficult to learn, thus contributing to the effective handling of class imbalance [30].
It is important to note that the methods described above represent only a subset of the extensive research conducted on data-driven approaches to handle class imbalance. Numerous other algorithms, such as Safe-Level SMOTE [31], MWMOTE [32], DBSMOTE [33], and more, have been proposed in the literature. To gain a comprehensive understanding of the advancements in this area, we direct the reader to the article [25], which offers a systematic review of various SMOTE variants up until the year 2018. In the aforementioned research, the authors proposed a framework to categorize the different changes performed to the original SMOTE algorithm. In this work, we adopt a modified version of this taxonomy for oversampling techniques as outlined in Fig. 2. The methods are first classified into interpolation-based or non-interpolation-based approaches based on methods’ applied principle for sample generation. Then, within the interpolation-based methods, six of the seven categories of SMOTE modifications proposed in [25] are adopted, excluding the "integration with undersampling" category, which is viewed as an hybrid sampling technique in this study. As a result, interpolation-based methods can fall under multiple categories simultaneously if they incorporate several modifications to the original SMOTE. Given that the majority of recent oversampling techniques are interpolation-based, we believe this revised taxonomy effectively encompasses the existing body of literature. It is important to note that this section focuses solely on interpolation-based methods, while “Non-interpolation based oversampling approaches” addresses recently introduced non-interpolation-based approaches.
Fig. 2
Modified taxonomy for oversampling methods categorization. Note that, within interpolation-based methods, the taxonomy proposed in [25] is adopted
×
Based on the presented taxonomy, we intend to continue the efforts reported in [25], and conduct a thorough review of the new SMOTE-based approaches reported in the literature from 2019 to date. To ensure clarity and coherence, we present our findings in the form of a table, which makes straightforward a concise summary of taxonomy, methodology and conclusion from the various reported works.
Discussion
In the realm of SMOTE algorithm modifications, the literature has predominantly revolved around the selection of relevant samples for oversampling, the type of interpolation employed, the implementation of data cleaning and the adaptive generation of samples. Regarding the first aspect, it aims to address a primary limitation of SMOTE, whereby all minority class samples possess an identical probability of being oversampled. Consequently, this often leads to an increase in noise or class overlap. Within this context, the primary challenge lies not only in identifying and eliminating outliers, but also in managing within-class imbalances where the minority class may consist of multiple clusters of varying sizes. Consequently, consistently differentiating between small disjuncts and outliers within datasets with distinct characteristics (such as dimensionality and noise levels) can be problematic. This issue becomes especially pronounced in cases of high imbalance scenarios, where the limited number of samples impedes drawing robust conclusions about the data distribution. It is worth noting that employing an excessively aggressive outlier elimination approach may result in low sensitivity, whereas the converse situation may yield high sensitivity but low specificity. Thus, achieving a balance between both metrics is of utmost importance. To tackle these challenges, researchers have predominantly adopted strategies based on distance-based analysis. Depending on the specific neighbourhood under consideration, a more localized or global analysis of the data distribution can be performed, enabling the identification of regions where oversampling is appropriate based on predefined criteria. Additionally, certain studies, such as [41], advocate for a multiresolution analysis to provide a simultaneous global and local analysis to enhance the robustness of the conducted analysis. Moreover, alternative approaches leverage clustering, Gaussian Mixture Models (GMM) [2], or outlier removal algorithms to select adequate instances for oversampling [60, 61, 63].
Regarding the second previously mentioned aspect, the type of interpolation, these modifications strive to produce a more diverse set of synthetic samples that closely mimic the true distribution of the minority class. This, in turn, amplifies the generalizability of models post-oversampling. To accomplish this objective, researchers employ diverse techniques, including random affine combinations of neighbouring points, generation of points within cuboidal planes, and other innovative approaches. Thirdly, in relation to the integration of data cleaning procedures, the primary objective is to mitigate the occurrence of overlap after the generation of synthetic data. As previously discussed, this factor significantly contributes to the complexity of the classification task and leads to a decline in performance. In general, most approaches documented in the literature undertake a neighbourhood analysis to evaluate the viability of the newly generated sample, whereby the presence of a neighbourhood predominantly composed of the majority class may indicate its inadequacy. Additionally, it is important to note that several studies have reported performance degradation in methods applied to datasets characterized by a notable number of outliers [23, 43]. This suggests the importance of introducing an initial sample selection or data cleaning mechanism to enhance the success of a resampling approach [52].
Finally, as far as the adaptive generation of samples is concerned, existing literature presents approaches that seemingly operate on fundamentally opposing principles. On the one hand, we have algorithms that generate synthetic samples preferably in regions of high classification complexity, typically characterized by a neighbourhood composed mostly of majority samples. As such, the main objective is to maximize the visibility of small clusters or samples proximal to the class boundary, as opposed to regions with a higher density of minority class samples. Although this approach leads to a greater expansion of the minority class and potentially greater model generalization, it also exhibits a propensity for inducing noise. On the other hand, other approaches primarily conduct oversampling in safe regions, found within neighbourhoods predominantly occupied by the minority class. By doing so, these methods aim to minimize the likelihood of oversampling noisy instances and to mitigate class overlap. Once again, a trade-off between reinforcing the representativeness of the minority class and the potential for generating noise and class overlap becomes apparent.
Shifting the focus to an analysis of the methods as a whole, one can quickly notice that new methods are systematically compared against non-state-of-the-art approaches, such as SMOTE or ADASYN, rather than more recent approaches. Consequently, it becomes challenging to effectively assess the potential of the new approaches and how they measure up against other state-of-the-art techniques. That being said, in [68], the authors conducted a comprehensive comparison of 85 resampling methods reported in the literature evaluated on 104 imbalanced datasets. This analysis revealed that polynomial-fit-SMOTE [69], ProWSYn [70], and SMOTE-IPF [37] demonstrated the most competitive performances among the examined methods.
In polynomial-fit-SMOTE new samples are generated based on Curve Fitting Methods to obtain the coefficients of polynomials used to model the minority class. This approach can be applied in different modes, depending on the chosen curve fitting technique. For instance, one of the variants of Polynomial-fit-SMOTE, known as star topology, involves obtaining \(n\) linear functions of the form \({f}_{n}(x) = ax + b\), which are derived from the midpoint of the minority class and the \(n\) samples from the minority class. Subsequently, new samples are generated along the computed linear functions, thus enhancing sample dispersion, and providing increased visibility of the minority class [69]. In Proximity Weighted Synthesis (ProWSyn), an adaptive sample generation mechanism is introduced that is inversely proportional to the distance to the majority class, i.e., greater emphasis is placed on oversampling instances located on the border between classes [70]. In the case of SMOTE-IPF, a C4.5 classifier with cross-validation is employed to identify and remove misclassified samples post-oversampling. This is achieved by utilizing a voting scheme, such as majority or consensus, which minimizes the generation of overlapping instances [37]. Considering that these methods have demonstrated superior performance under identical experimental protocols, certain conclusions can be drawn based on their similarities. Firstly, it can be inferred that adopting oversampling methodologies that specifically target samples with higher classification complexity, particularly those located near the decision boundary, can offer a better compromise between sensitivity and specificity within the above-mentioned trade-off, as also suggested in [71]. Moreover, incorporating post-oversampling data cleaning techniques can further enhance the classification performance as previously suggested, by reducing noise and overlap among generated samples. Lastly, it would be beneficial for future studies to consider these oversampling techniques as the new baseline, as they largely surpass SMOTE in terms of performance. A final relevant aspect to consider is the fact that new methods jointly suggest several alterations to the original SMOTE, for instance, a novel manner of selection initial samples to be oversampled coupled with a different form of interpolation for sample generation. As a result, and considering that methods are evaluated as a whole, quantifying the relative importance of each proposed step and comparing it with approaches reported in the literature becomes challenging. Therefore, it is recommended for benchmarking methods to adopt a sequential and modular evaluation framework to comprehensively assess the proposed methodologies.
Despite the advantages offered by resampling techniques, particularly oversampling, for balancing class distributions and enhancing the uniformity of classifier performance with respect to target variables, they are not without limitations. Specifically, they do not guarantee the removal of all biases inherent in the data. For example, the Adult dataset, discussed in [72], contains 12 categorical and continuous attributes (e.g., age, gender, occupation) and is used to predict whether an individual's annual income exceeds $50,000. High-income earners constitute the minority class. When ROS was applied, introducing 20,000 synthetic samples, only 15% of these were women, thereby perpetuating gender bias, a commonly recognized sensitive attribute. This example underscores the risk of resampling introducing or reinforcing biases when the characteristics of the generated samples are not adequately considered [72]. While imbalanced learning often focuses on achieving class distribution parity, fairness concerns related to sensitive attributes are more rigorously addressed in the field of algorithmic fairness. This discipline prioritizes ensuring that machine learning models produce equitable outcomes for protected groups (e.g., gender, ethnicity, or age), rather than merely balancing class representation [73].
Fairness-related interventions are predominantly applied during or after model training, with resampling-based approaches being comparatively rare [74]. However, notable exceptions exist, such as Fair Oversampling [73], which employs selective resampling to incorporate feature blurring, and Fair Class Oversampling [72], which integrates clustering, boundary-sample filtering, and targeted minority-class oversampling. For a detailed analysis of fairness methods and evaluation metrics, we refer readers to [74]. Ultimately, ensuring ethical integrity in model development and deployment requires addressing biases from multiple sources, including both class imbalances and feature-specific disparities, which is crucial not only for fairness but also for fostering public trust in such systems.
Non-interpolation based oversampling approaches
While the prevailing oversampling approaches predominantly rely on interpolation-based data generation techniques, there exists a subset of algorithms that deviate from this principle, which are framed within the non-interpolation based category of the adopted taxonomy. In a recent work by Xie et al., Gaussian Distribution-based Oversampling (GDO) is introduced as an alternative strategy [43]. GDO generates new samples by translating existing minority samples, referred to as anchor samples, using a vector with a randomly selected direction. The length of this vector is determined by sampling from a Gaussian distribution with a mean of zero and a standard deviation equal to the distance of the nearest neighbour to the anchor sample. In addition to the generation process, GDO incorporates a mechanism to determine the sampling probability for each minority sample, as previously seen is SMOTE-based methods, so as to mitigate the likelihood of oversampling irregular or noise data. The sampling probability is derived from two factors: a Density factor and a distance factor. The Density factor is evaluated by considering the percentage of \(k\)-nearest neighbours belonging to the majority class. Similarly, the distance factor is calculated as the ratio between the sum of distances from the sample under analysis, \({x}_{i}\), to all other \(k\)-nearest neighbours within the same class, and the sum of distances from \({x}_{i}\) to all \(k\)-nearest neighbours (comprising both majority and minority class samples). Subsequently, the calculated values are normalized to ensure that their sum equals one, which dictates the oversampling frequency for each sample. Experimentally, GDO outperformed other state-of-the-art resampling methods in terms of G-mean in 16 out of 40 datasets from the KEEL repository. Furthermore, it was observed that GDO exhibited enhanced capability in handling datasets with high imbalance rates [43].
In a separate study, Bellinger et al. designed an oversampling methodology aimed at classification problems with an extreme degree of imbalance (\(IR>1000\)), designated Sampling WIth the Majority class (SWIM) [40]. Recognizing that, in such scenarios, the number of minority samples is insufficient to have a reasonable approximation of the data´s distribution, the authors suggest a new approach to generate new synthetic samples that does not depend on the position or distance between samples of the minority class, but on its density with respect to the distribution of the majority class. In practical terms, this method foresees, in a first stage, centring the majority and minority class followed by the whitening procedure that transforms the data to have an identity matrix as its covariance matrix. Subsequently, the upper, \({u}_{f}\), and lower bounds, \({l}_{f}\), are calculated for each feature, \(f\), based on the following formulation: \({u}_{f}= {\mu }_{f}+\alpha {\sigma }_{f}\) e \({l}_{f}= {\mu }_{f}-\alpha {\sigma }_{f}\), where \(\alpha \) corresponds to a parameter that defines the size of the regions where new samples can be generated. To generate new samples, a random value within the predefined limits is selected for each feature, culminating in a new sample of the minority class. Finally, the samples are transformed to the original space. Note also that the authors suggest a non-parametric variant of SWIM based on radial basis functions with Gaussian kernels to estimate the density of the majority class. In an experimental setting [40], It was found that SWIM guaranteed the most competitive G-mean at higher imbalance ratios. In lower imbalance regimes, it had comparable performances to SMOTE-TL and SMOTE. Furthermore, an analysis based on PCA showed that SMOTE based techniques have better performance in relation to SWIM when the minority class is in a single very compact cluster.
Undersampling approaches
The evolutionary trajectory of undersampling approaches has mirrored that of oversampling approaches, since the first proposed method was Random Undersampling (RUS), which involved the random removal of instances from the majority class. By doing so, RUS imposes an identical significance to both classes during the optimization process of classifiers. However, the application of RUS carries a substantial risk of information loss, thereby adversely impacting the classifier's performance [1]. Consequently, informative undersampling techniques have emerged, incorporating data characteristics to selectively eliminate samples to enhance the visibility of the minority class. Although numerous undersampling approaches have been proposed in the literature, this section will be focused on most used techniques and available in open-source repositories.
In response to the intrinsic loss of information associated with random sample removal, one of the initial methods introduced was the Condensed Nearest Neighbours (CNN) technique [65]. This approach firstly foresees the separation of minority class samples into a dedicated set, denoted as \(C\), while assigning the majority class samples to set \(S\). Subsequently, majority samples are added to set \(C\) one by one and a 1-NN classifier is employed to determine whether the added sample is correctly classified. In the case the sample is misclassified, it is permanently integrated into set \(C\). This iterative process is repeated until no further samples can be added and the obtained set \(C\) will correspond to the new balanced dataset. The primary objective of CNN is to derive a representative subset of the data distribution that exhibits performance on par with that obtained using the complete dataset, this concept being termed minimal consistent subset [65]. However, it is important to note that this algorithm has some vulnerability to noisy samples, since these will often be misclassified and thus included in set \(C\). To address this limitation, Kubat et al. proposed the One-Sided Selection (OSS) method, where after applying CNN, the majority samples that integrate Tomek-Links are removed [75]. Tomek-Links are pairs of samples with distinct classes that are nearest neighbours to each other and, as such, normally correspond to instances located at the border between classes [76]. Another prominent approach, which shares a similar concept with the OSS technique, where all minority class samples are retained in the final balanced dataset, is the Neighbourhood Cleaning Rule (NCR) [77]. The NCR technique leverages the Edited-Nearest Neighbours (ENN) method to identify outliers within the majority class. ENN is an undersampling technique that removes samples whose labels differ from the majority of their k-nearest neighbours [77]. Returning to NCR, a subsequent step involves employing a 3-Nearest Neighbour (3-NN) classifier on the minority class instances. If misclassifications occur, the nearest neighbours from the majority class are eliminated. It is important to note that this description has been simplified to the context of binary classification, whereas the original algorithm is designed to handle multiclass problems, as thoroughly described in [77].
In 2003, Zhang et al. suggested 3 different undersampling methods based on neighbourhood analysis, namely NearMiss-1, NearMiss-2 and NearMiss-3. NearMiss-1 provides for removing majority samples whose average distance to the \(k\)-nearest neighbours of the minority class is the smallest (original article used \(k=3\)). In contrast to NearMiss-1, NearMiss-2 assumes the removal of majority samples whose distance from the farthest \(k\)-nearest neighbours are the smallest. Finally, NearMiss-3 involves two consecutive steps, where first the majority \(M\) nearest neighbors of each minority class are determined. Then, within the defined \(M\) nearest neighbours, those whose average distance to the 3 minority nearest neighbours is closest are eliminated [78].
Lastly, we draw attention to the methodology introduced by Smith et al., which is centred around the notion of instance hardness (\(IH\)) [79]. IH represents the probability of a sample being misclassified and serves as a metric for assessing the classification complexity of a particular instance, as introduced in "Classification complexity in imbalanced domains". Formally, the concept of instance hardness can be derived from Bayes theorem and assumes the following formulation exposed in Eq. (12).
where \({x}_{i}\) and \({y}_{i}\) represent the data and corresponding labels and \(h\) refers to the learning algorithm, thus adhering to the notation outlined in "Class imbalance: formal definition". As one can observe, the computation of instance hardness is dependent on the chosen classifier, however, the authors suggest adopting a weighted average of the instance hardness obtained with several learning models, \(\mathcal{L}\), as exposed in Eq. (13), in order to attain a general assessment of instance hardness [79].
From a practical standpoint, the estimation of the conditional probability \(p\left({y}_{i}|{x}_{i},h\right)\) relies on the scores derived from the classifier \({h}_{j}\). As for the selected classifiers, clustering technique based on Classifier Output Difference was employed [80], which enables the attainment of a diverse set of learning algorithms, thus generating a representative value of \(\text{IH}\). In order to effectively apply the quantification of instance hardness for informative undersampling, the authors put forth an approach involving the utilization of a threshold. Specifically, samples characterized by an \(IH\) value surpassing a predetermined threshold (0.7 exhibited the best results in the conducted research) are designated for removal [79].
To complement the discussion of undersampling methods, it is important to highlight the study conducted by Wainer et al. [26], which presented a comparative analysis of all the aforementioned methods across 58 imbalanced datasets from the KEEL repository. Experimentally, a threefold repetition of a stratified 20% holdout procedure was used, with a nested threefold cross-validation to select the optimal hyperparameters for each undersampling method, where the classifier was also treated as a hyperparameter. After determining the performance of each approach using 8 distinct performance metrics, a statistical analysis was carried out to determine if there were significant differences between the methods. More specifically, pairwise Wilcoxon tests were applied to the ranks of each method, with a multiple comparison adjustment of the p-value. The results showed that, for the F1-score, NCR had the lowest mean rank, although it did not exhibit significant differences compared to Near-Miss, CNN, and IH.
After reviewing some of the most relevant approaches within undersampling techniques to address class imbalance, an overview of the literature regarding the most recently reported approaches is presented next, as shown in Table 4. Furthermore, in order to enhance the clarity and systematic analysis of the undersampling approaches, we adopt a taxonomy framework proposed in [1]. In this taxonomy, resampling approaches are initially categorized into three main groups: stratified approaches, synthetic data generation methods, and hybrid approaches. Among the stratified approaches, which primarily involve undersampling techniques, further subdivision is performed, resulting in six distinct categories: distance-based, recognition-based, evolutionary-based, cluster-based methods, data cleaning based and random under/over sampling. A schematic representation of the adopted taxonomy is presented in Fig. 3.
Fig. 3
Adopted taxonomy for categorizing undersampling approaches based on the work of Branco et al. [1]
×
Distance-based methods leverage inter-class or intra-class distances to select pertinent samples. Recognition-based approaches operate by utilizing information from a single class to perform informed resampling, deviating from the traditional discrimination-based inductive learning paradigm [1]. These approaches excel in extreme imbalance scenarios where the available data is insufficient to adequately approximate the underlying data distribution. Examples of recognition-based approaches include Autoencoders and one-class SVMs. Cluster-based resampling methods employ clustering algorithms to model either one or both classes, whereas evolutionary methodologies utilize evolutionary algorithms to optimize a predefined fitness function, ultimately generating a resampled dataset. Cleaning-based methods aim to remove outliers and address instances with overlapping characteristics, as these factors significantly impact the performance of learning models [5]. Lastly, random-based approaches randomly select samples for removal [1]. It is worth noting that certain proposed methods encompass multiple steps, such as combining data cleaning and cluster-based undersampling, resulting in approaches that can simultaneously belong to multiple categories.
Discussion
In line with oversampling approaches, undersampling methodologies also demonstrate a noticeable duality, given that these operate based on two distinct principles. On the one hand, there are methods that perform undersampling in regions of high majority class density, under the assumption that these samples have minimal impact on the model learning process, i.e., the loss of information is marginal. Clustering methodologies are a clear example of such approach. On the other hand, other approaches primarily target the class border to effectively mitigate overlap and the presence of outliers. Data cleaning undersampling methods serve as a classic illustration of this category, where the idea is to maximize the visibility of the minority class. Due to the contrasting paradigms underlying these approaches, their experimental results also differ noticeably. Resampling methods operating near the decision boundary often result in a higher removal of majority samples located at the boundary region, leading to a shift of the decision boundary towards the interior of the majority class. This shift tends to increase true positives (higher sensitivity) but also introduces more false positives (lower specificity). In contrast, resampling methods targeting the central regions of the classes, often referred to as safe regions, tend to strike a better balance between sensitivity and specificity, resulting in higher F1-score and G-mean values but lower sensitivity. Hence, it becomes evident that resampling methods focusing on the decision boundary prove particularly pertinent in domains where the cost associated with false negatives is significantly high, as is the case of the medical field.
Regarding the strategies employed by the undersampling methods to evaluate classification complexity and identify the most relevant samples based on predefined criteria, we observe, once again, a similar pattern to oversampling methods. The predominant strategy appears to be distance-based neighbourhood analysis, which serves as a common method to approximate the underlying data distribution. However, certain methods have explored alternative techniques such as Gaussian Mixture Models (GMMs) [85] or even reinforcement learning [95].
Furthermore, it is important to note that the evaluation and comparison of these new methods often involve benchmarking against non-state-of-the-art approaches such as Random Undersampling, Random Oversampling, or NearMiss. Once again, this practice renders assessing the competitiveness of the proposed methods very challenging. Additionally, we believe that benchmarking undersampling methods should jointly consider both performance metrics and runtime of the algorithms. This is particularly relevant as undersampling methods offer the advantage of significantly reducing the number of samples as opposed to oversampling methods, especially in cases of high imbalance ratios, resulting in faster model training and inference. Hence, it is crucial to consider whether the performance gains achieved by oversampling methods justify the associated computational costs.
Hybrid approaches
As mentioned earlier, the utilization of oversampling methods in data resampling introduces the potential risk of amplifying noise and inducing class overlap, thereby increasing the likelihood of overfitting. Conversely, undersampling methods face the challenge of removing informative samples necessary for accurate model training, resulting in underfitting [95, 96]. To overcome the limitations associated with both approaches, the emergence of hybrid resampling methods has provided a solution by simultaneously employing oversampling and undersampling methods, either in an alternate or sequential manner [45, 96]. Another motivation behind the development of hybrid approaches stems from previously conducted analysis in which no single method seems to guarantee the most competitive performance in all classification contexts. Given the observation that oversampling methods do not consistently outperform undersampling methods and that both can be advantageous under specific circumstances, their integration represents a logical alternative for the creation of more robust and consistent resampling methodologies [95]. Typically, the resampling process commences with the implementation of oversampling techniques, followed by the introduction of undersampling measures aimed at mitigating the occurrence of class overlap through the removal of majority samples, ultimately culminating in the construction of a balanced dataset. In essence, the role of undersampling is analogous to the data cleaning procedures employed post-oversampling with SMOTE, with the distinction that data cleaning operates on the generated minority samples instead of eliminating majority samples as observed in undersampling.
One of the initial hybrid methods that emerged in the field of data resampling was the combination of SMOTE with Tomek links [45] and SMOTE with ENN [58]. Regarding the first method, it foresees the application of SMOTE to generate synthetic samples, followed by the determination of Tomek links present in the balanced dataset. Finally, the majority samples that integrate Tomek links are eliminated [45]. On the other hand, the SMOTE + ENN approach begins with the application of SMOTE to augment the minority class, and subsequently employs the ENN algorithm to address misclassified samples, irrespective of their class (majority or minority). As such, the data cleaning process conducted by SMOTE + Tomek links is more comprehensive [58]. Over the years, a wide range of hybrid resampling methods have been proposed. In line with the preceding sections, we aim to analyse the recent trends in the literature pertaining to this topic, as described in Table 4.
Discussion
Initially, one can verify that approaches where oversampling is applied followed by undersampling and vice-versa are equally common. Within the first defined architecture, the majority of methods involve the application of SMOTE followed by an undersampling method that aims to mitigate the overfitting tendency associated with SMOTE. Typically, these undersampling methods leverage neighbourhood analysis to evaluate regions of high classification complexity or employ genetic algorithms for sample selection. As for the second framework, it is important to note that oversampling is done through Random oversampling or SMOTE where sample generation is adaptive, that is, inversely proportional to the instance hardness of the minority samples. Thus, and assuming that this is the last step of the hybrid method, it is advisable to incorporate mechanisms that mitigate SMOTE´s inclination to oversample noisy or overlapping instances. Moreover, various alternative resampling architectures have been proposed. In a study by [106], the authors introduce a methodology that involves the iterative execution of undersampling and oversampling steps, accompanied by continuous validation of the generated or eliminated instances using the error rate of a Decision tree classifier. Similarly, the \(SS{O}_{maj}-SMOTE-SS{O}_{min}\) method proposes the interleaved application of two particle swarm optimization undersampling processes with SMOTE oversampling.
To finalize this discussion, it is important to stress that one of the primary concerns associated with hybrid resampling methods is the inherent computational cost of employing two or more resampling mechanisms. To tackle this challenge, simpler resampling methods such as SMOTE or ENN are often preferred, as they reduce the overall computational complexity of the method. It should be noted that even the three-step approach mentioned earlier, \(SS{O}_{maj}-SMOTE-SS{O}_{min}\), conducts undersampling as a preliminary stage, resulting in a substantial reduction in the number of samples generated by oversampling. This design choice ensures that the method's runtime is comparable to that of ADASYN and even SMOTE [106].
Resampling in deep learning applications
Deep learning is a subfield of machine learning that relies on models built with multi-layered architectures, enabling the learning of hierarchical data representations [111]. This mimics, to some extent, the structure and functionality of the human brain [111, 112]. These layers are typically tailored to the nature of the data being analysed, for instance, 2D convolutional layers are often used for analysing spatial patterns in images, while transformer or Long Short-Term Memory (LSTM) blocks are suited for processing sequential data [111]. One key differentiator of DL from traditional ML approaches is their ability to bypass the need for many handcrafted preprocessing steps such as feature selection. Instead, DL models can learn representations directly from raw data, providing an unprecedented level of flexibility and ability to capture complex patterns in the data [111]. However, this capability comes at the cost of increased model complexity, imposing substantial computational resources. That being said, the evolution of specialized hardware and software for highly efficient and parallelized model training, alongside the increased availability of large-scale datasets, has alleviated these limitations, enabling the development of DL systems that consistently outperform conventional (shallow) models [111, 112]. For a comprehensive examination of state-of-the-art deep learning frameworks, we direct the reader to [111].
Nevertheless, DL models, much like traditional ML methods, struggle with challenges associated with imbalanced data distributions. This issue is compounded by factors such as data drift, high number of classes, and the relationship between imbalanced data and the extracted embeddings [113]. To address these limitations, numerous approaches have been proposed, spanning both data-driven and algorithmic-driven solutions. Within the scope of this article, we concentrate on interpolation-based data-driven methods, such as those elaborated upon in "Oversampling", "Undersampling approaches" and "Hybrid approaches".
In [114], the authors proposed a network detection system (NDS) leveraging a hybrid sampling technique combined with a one-dimensional convolutional neural network (1D-CNN). The study utilized two open-source datasets: UNSW-NB15, comprising 2.54 million samples with 87.35% representing normal traffic and 12.65% attack traffic, and CICIDS2017, consisting of approximately 2.8 million samples, of which 80.3% correspond to normal traffic and the remaining data span 14 distinct attack categories [114]. Following preprocessing steps such as one-hot encoding, feature reduction, data standardization, and feature selection, resampling was applied using a specialized method. In this approach, the target sample size per class \({I}_{resample}\) was defined as the total number of samples divided by the number of classes. For classes with fewer samples than \({I}_{resample}\), SMOTE was employed, while classes exceeding this threshold underwent Gaussian Mixture Model (GMM)-based clustering for sample extraction [114]. Experimentally, the proposed method consistently achieved higher F1-scores in both binary and multiclass classification tasks compared to approaches combining Random Forest (RF) and Multi-Layer Perceptron (MLP) with traditional resampling techniques such as ROS, SMOTE, ADASYN, RUS + SMOTE, and K-means + SMOTE. Additionally, the approach outperformed other state-of-the-art methodologies [114]. In the context of cybersecurity, Abdelkhalek et al. aimed to develop a network detection system (NDS) based on the open-source dataset NSL-KDD, which includes normal traffic data and four types of attacks: Denial of Service (DoS), Probing, Remote-to-Local (R2L), and User-to-Root (U2R) attacks [115]. To address the class imbalance present in the dataset, the authors employed a combination of ADASYN and Tomek Links. This resampling approach was evaluated using various models: Convolutional Neural Network (CNN), Multi-Layer Perceptron (MLP), Deep Neural Network (DNN), and a hybrid CNN-BLSTM architecture. Experimentally, it was observed that in binary classification scenarios, the CNN model combined with ADASYN and Tomek Links achieved the most competitive F1-score (84%) [115]. Conversely, for multiclass classification, the MLP model coupled solely with oversampling yielded the best performance (87%). Additionally, independent of the model used, the application of oversampling consistently resulted in significant performance improvements [115]. In [116], the authors also developed a method addressing class imbalance through the combination of random undersampling of the majority class (normal traffic) and SMOTE applied to two minority classes (Web Attacks and Infiltration Attacks). Classification was performed using a model comprising two LSTM layers and three fully connected layers [116]. The dataset employed in this study was CSE-CIC-IDS2018, which contains eight classes, seven of which correspond to distinct network attack types. Experimental results demonstrated that the adopted architecture outperformed traditional machine learning algorithms such as Decision Trees, Random Forests, k-Nearest Neighbours, and Support Vector Machines [116]. In the context of fraudulent transaction detection, the authors in [117] utilized SMOTE-ENN to generate synthetic samples due to the severe class imbalance in the European credit card dataset, where only 0.172% of the samples represented fraudulent transactions. Following this resampling step, an ensemble model was trained, consisting of a GRU network and an LSTM network, whose predictions were combined via a MLP acting as a meta-classifier [117]. Experimental results demonstrated that incorporating resampling into this approach enhanced sensitivity, specificity, and AUC by 10.5%, 3.2%, and 8.7%, respectively. Furthermore, the proposed method was tested on another dataset (Taiwan dataset), where similar performance improvements were also observed, along with superiority over other machine learning algorithms such as AdaBoost, Random Forests (RF), MLPs, LSTMs, and GRU networks [117]. In the domain of emotion detection using EEG signals, motivated by the rising prevalence of health issues related to stress and anxiety, the authors in [118] developed a methodology to evaluate degrees of arousal and valence, categorized as positive, medium, or negative, based on physiological signals. Data was sourced from the DEAP database, which comprises EEG recordings from 32 participants collected through 32 electrodes. After preprocessing, spectral power features from five frequency bands were extracted using the Short-Time Fourier Transform applied to 2-s sliding windows [118]. Subsequently, Borderline SMOTE was applied for oversampling, and the resulting augmented data were used to train a 1D-CNN. Results showed an increase in F1-score from 92.04% to 97.54% for valence classification and from 92.75% to 97.84% for arousal classification, following the introduction of resampling [118].
In the field of medical applications, the authors in [119] assessed three network architectures, CNN, CNN-LSTM, and an ensemble combining predictions from both a CNN and a CNN-LSTM, with three resampling techniques (RO, SMOTE, and SMOTE + Tomek Links) to classify cardiac arrhythmias. Two open-source datasets, MIT-BIH and PTBDB, were employed, collectively providing around 110,000 heartbeat samples categorized into six distinct classes, with the majority representing normal heartbeats and the remainder denoting various arrhythmia types [119]. The preprocessing stage involved only the computation of the signal derivative, which served as the input to the models. Among the architectures tested, the ensemble model exhibited the best performance, with SMOTE and SMOTE + Tomek Links enhancing the F1-score by 0.31% and 1.63%, respectively, while RO led to a decline in performance [119]. In another study, Sadeghi et al. developed a DNN with a structure consisting of one input layer, three hidden layers, and one output layer to predict diabetes onset [120]. The dataset used was the Tehran Lipid and Glucose dataset, containing demographic, anthropometric, lifestyle, among other features. The DNN was evaluated with four resampling methods: Repeated Edited Nearest Neighbours (RENN), OSS, SMOTE, SVM-SMOTE, and SMOTE-ENN [120]. Experimental findings demonstrated that RENN and OSS achieved the highest F1-scores, while SMOTE resulted in the lowest performance among the tested approaches [120]. In [121], the authors developed a system for identifying human activities in smart homes based on multimodal data, a type of application particularly relevant for assisting elderly individuals or people with cognitive impairments. Specifically, they utilized the SHPERE dataset, which includes data from accelerometers worn by participants, information gathered from Passive Infrared (PIR) sensors, and 2D and 3D data captured by three cameras [121]. The dataset labels corresponded to 20 distinct actions such as climbing stairs, sitting, jumping, and others. After preprocessing and feature extraction, three models—CNN, LSTM, and CNN-LSTM—were evaluated, along with their combinations with resampling techniques, including ROS, SMOTE, and RUS. The results indicated that oversampling methods significantly enhanced accuracy, most notably from 68.29% to 93.67% when SMOTE was combined with the CNN-LSTM model. In contrast, the use of undersampling methods resulted in performance reductions exceeding 50% compared to the baseline with no resampling [121].
The study in [122] introduced a novel framework, DeepSMOTE, which combines an encoder-decoder architecture with the SMOTE algorithm to address class imbalance in image datasets. Raw images are mapped into a low-dimensional latent space via the encoder, where SMOTE is applied to perform oversampling. The generated synthetic samples are subsequently decoded back into the original image space. This approach enhances SMOTE’s ability to process raw images while maintaining high visual quality in synthetic data. Additionally, DeepSMOTE overcomes common limitations of GANs, such as their requirement for extensive data, difficulty in hyperparameter tuning, and vulnerability to mode collapse. The training process for DeepSMOTE begins by encoding and decoding data from all classes to compute a reconstruction loss [122]. To introduce diversity in the latent space, an additional penalty term is incorporated, prompting the network to decode samples that differ from the input images, thereby emulating the SMOTE mechanism in the latent space. This penalty is realized by encoding, shuffling, and decoding samples within the same class, with the reconstruction loss being calculated between the original images and their decoded counterparts derived from different (shuffled) input images. Five publicly available image datasets—MNIST, CIFAR-10, Fashion-MNIST, SVHN, and CelebA—were used for evaluation, alongside six resampling techniques: four pixel-based methods (SMOTE, AMDO, MC-CCR, and MC-RBO) and two GAN-based techniques (BAGAN and GAMO). The ResNet-18 model was chosen as the backbone for all experiments. Analysis of synthetic sample placement showed that DeepSMOTE prioritized generating samples at class boundaries, thereby improving inter-class discrimination. Conversely, GAN-based methods like BAGAN and GAMO concentrated synthetic samples near class centres. In terms of performance, DeepSMOTE significantly outperformed all other methods, particularly when compared to pixel-based oversampling. The study also highlighted DeepSMOTE’s resistance to mode collapse, its robustness to extreme class imbalance, its ability to generate visually realistic synthetic images, and its consistent performance across varied conditions [122]. In a similar approach, Abayomi-alli et al. [113] proposed a method for transforming data into a low-dimensional space, where oversampling is performed using SMOTE-Cov [123]. Subsequently, the data are reverted to their original space for model training. In this case, the dimensionality reduction process is guided by the Kullback–Leibler divergence, aiming to achieve minimal divergence between the original and lower-dimensional representations. Subsequently, reverting the data to the original space is achieved using local regression techniques, such as cubic polynomials, which approximate the mapping based on neighbouring points within the manifold [113]. The proposed algorithm was applied to differentiate between melanoma and non-melanoma skin lesions, yielding an increase in the F1-score from 62.5% to 65.4% on the PH2 dataset via resampling, leveraging the SqueezeNet model [113].
Overall, it is apparent that the majority of Deep Learning applications apply resampling directly to raw data, often disregarding complexities such as temporal dependencies in sequential data or spatial patterns in images. Despite this, the examined studies commonly report performance gains, validating the effectiveness of resampling techniques. Nonetheless, it is reasonable to infer that conducting oversampling within lower-dimensional latent space, such as those used in DeepSMOTE, can better preserve data properties, potentially achieving even more pronounced performance improvements. Furthermore, the observed preference for oversampling and hybrid sampling methods, rather than undersampling, is logical given the strong dependency of Deep Learning models on large datasets for effective training. Finally, it is noteworthy that the resampling methods employed are often relatively simple and do not incorporate complexity assessments to adopt more sophisticated resampling protocols, indicates significant potential for optimization in this domain.
After evaluating the three principal categories of resampling methods applied in traditional machine learning and deep learning systems, their main domains of application can be identified. It is important to highlight that most of these techniques are intentionally designed to be application-agnostic, enabling them to handle any datasets with numerical attributes that can either be converted into one-dimensional arrays or are already in the required format. The sole exceptions are DeepSMOTE [122] and the methodology proposed in [113], both of which are explicitly developed for image-related tasks. The reviewed literature reveals the widespread adoption of resampling techniques across several fields. In healthcare applications, examples include Outlier SMOTE for COVID-19 diagnosis [34], Homogeneous Consensus Clustering-Based Undersampling for protein homology prediction [22], emotion detection [118], and cardiac pathology classification [119]. In software engineering, resampling has been employed for bug detection (e.g., SMOTEFUNA [23], CR-SMOTE [63]) and network detection systems [114‐116]. In finance, methods such as Gaussian Mixture Undersampling and SMOTE-ENN integrated with ensemble models leveraging LSTMs and GRUs have been used for credit card fraud detection [117]. These findings are further corroborated by a recent systematic review conducted by Vargas et al., which analysed reported resampling techniques across seven databases [8]. Utilizing the keywords “imbalanced data,” “machine learning,” and “preprocessing,” the researchers reduced a selection of nearly 10,000 articles to 35 after a thorough screening process. Among the selected studies, 34.3% focused on healthcare, 22.9% on finance, 14.3% on engineering, 8.6% on biology, 8.6% on software, and 11.4% on other domains. This distribution reflects and reinforces earlier observations, confirming the cross-disciplinary relevance of resampling methodologies [8].
Resampling approaches and complexity assessment
The analysis of Tables 3, 4 and 5 highlights that the various modifications proposed to resampling methods often require an evaluation of the classification complexity, where different metrics are computed in order to provide insights regarding the presence of data difficulty factors, which imply difficulty to the classification task. Consequently, by definition, these can be considered complexity measures, even though they do not strictly adhere to the formulations in "Complexity metrics in imbalanced domains: a survey" in most cases, and they can still be categorized under the taxonomy proposed by Santos [5], based on the insights they provide. To illustrate, in [64], the authors employ the L2 metric to quantify the extent of overlap in the data and subsequently utilize the Tomek-link algorithm for data cleaning. In [35], the N3 metric is employed to categorize samples into Normal Region, Semi-Normal Region, Semi-critical Region, and Critical Region, while [52] employs N3 to classify samples as safe, borderline, or outliers. In [41], the concept of hyperspheres, introduced in T1, is utilized to determine the Classification Contribution Degree of each sample, guiding the adaptive oversampling process. In [86], a metric very similar to the Bayes Imbalance Impact Index is utilized to determine the sampling probability of the majority class to avoid eliminating relevant samples during undersampling. In Borderline SMOTE and ADASYN, nearest neighbours analysis is employed to calculate the oversampling probability for minority samples, whereas in [52, 61] and [44], it is used to filter out noisy samples before oversampling, thus corresponding to the instance-level overlap metrics referenced in [5].Overall, instance-level overlap metrics seems to be the most commonly applied complexity metrics in this regard, given that they enable a sample-wise analysis of classification complexity.
Table 3
Comprehensive summary of recently reported works in the literature focusing on SMOTE-based oversampling methods
Similar method to ADASYN. Initially, a Euclidean distance matrix is computed, which captures the pairwise distances among all the points within the minority class. Secondly, the sum of all distances relative to each sample is computed (\(ES\)). Finally, the values obtained are normalised based on the sum of all the \(ES\) values. The resulting normalized values serve as a proportionate measure for determining the number of samples to be generated. As a result, this method places greater emphasis on generating synthetic samples in regions characterized by greater isolation
To assess the effectiveness of outlier-SMOTE, testing was conducted on 5 benchmark datasets sourced from the UCI repository. Additionally, a specialized dataset related to COVID-19 diagnosis was included in the evaluation. The findings consistently revealed the superior performance of outlier-SMOTE when compared to ADASYN [30] and original SMOTE [23], yielding improve Recall, Precision and F1-Score across the majority of the tested datasets
(1) Initial selection of samples to be oversampled
RSMOTE introduces a domain partitioning mechanism that divides the dataset into four distinct regions: Normal Region, Semi-Normal Region, Semi-critical Region, and Critical Region. The classification of these regions is determined based on the abundance of minority samples within the neighbourhood of each instance. The Normal Region predominantly encompasses neighbourhoods characterized by a high concentration of minority class instances, while the Critical Region represents the converse situation. Importantly, RSMOTE considers wide neighbourhoods, encompassing half of the minority class distribution, enabling a comprehensive analysis of the overall distribution patterns. Leveraging this domain characterization, RSMOTE can exclusively perform oversampling (via SMOTE) within the Normal Region, or exclusively in the Semi-Normal Region, etc
The examination of the proposed method on four UCI medical datasets enabled us to observe that performing oversampling in the Normal and Semi-Normal regions yielded the most competitive performances. In a consistent manner, this method outperformed both BorderlineSMOTE [28], Local Neighbourhood SMOTE [36] and SMOTE-IPF [37]
In this approach, Laplacian eigenmaps are utilized to construct a feature space that optimizes class separability and alleviates the risk of overlapping samples generated by SMOTE. Notably, Laplacian eigenmaps effectively capture the local structures inherent in the data, enabling the application of oversampling techniques. To put this into practice, the training data, in conjunction with unlabelled testing data, is employed to compute the transformed data representation using Laplacian eigenmaps. Subsequently, a resampling approach, such as SMOTE or Borderline SMOTE, is applied on the data residing in the newly obtained feature space. Note that, as proposed by the authors, the classification is conducted on the new feature space, meaning that the generated samples do not need to be transformed to the original space
Analysis of experimental results on 22 binary datasets from the KEEL repository showed that dimensionality shifting based on Laplacian eigenmaps guaranteed superior performance (F1-score and G-mean), when compared to applying PCA or Local Linear Embedding (LLE), regardless of the resampling methodology applied on the new feature space (SMOTE [23], Borderline-SMOTE [28], Safe-Level SMOTE [31], Cluster-SMOTE [39] and ADASYN [30] were tested)
The proposed method aims to generate samples that exhibit increased diversity and conform more closely to the true underlying distribution of the minority class. Initially, a sample is randomly selected from the minority class, \({S}_{1j}\), followed by the determination of the furthest minority sample, \({S}_{2j}\), being this distance assessed through the Manhattan distance. To generate new samples, the following formulation is applied: \({S}_{newj}=Random(\text{min}\left({S}_{1j},{S}_{2j}\right),\text{max}\left({S}_{1j},{S}_{2j}\right))\). This process forces the new sample to be located within the cuboidal potential space formed between the two minority samples. At last, the viability of the generated sample is verified by examining its nearest neighbour. If the nearest neighbour belongs to the majority class, the generated sample is considered unsuitable and is discarded
The algorithm underwent testing on 33 open-source software datasets, enabling the assessment of its performance in comparison with SMOTE [23], ADASYN [30], and SWIM [40]. Additionally, a Wilcoxon signed-rank test was conducted to determine the statistical significance of the performance disparities between SMOTEFUNA and the other algorithms. In the cases where the proposed method yielded inferior results, the authors attributed this fact to the lack of an outlier removal procedure before oversampling
(1) Selection of initial samples to be oversampled
(2) Adaptive generation of synthetic examples
The proposed methodology encompasses a dual-level analysis, leveraging both the macro distribution and the local neighbourhood of the minority class, to facilitate informed oversampling. For the macro distribution assessment, \(k\)-means clustering is applied to obtain the clusters referring to the minority class, followed by the computation of the cluster ratio,\({R}_{k}\), which corresponds to the number of minority samples within each cluster. Concurrently, the local analysis involves the calculation of two essential distances for each minority instance: the distance to its nearest neighbour of the same class, \({d}_{i}^{P}\), and the distance to the nearest majority class instance, \({d}_{i}^{N}\). Based on \({d}_{i}^{P}\) and\({d}_{i}^{N}\), the Type-P and Type-N safe neighbourhoods are defined, respectively, through the definition of hyperspheres with radiuses based on the previously calculated distances. Finally, the safety degree is estimated \(, {S}_{i}\), which corresponds to the difference between the number of samples of the minority class \(, {a}_{i}\), and majority class, \({b}_{i}\), contained in the union between the Type-P and Type-N safe neighbourhoods, as explained in the following formulation: \({S}_{i}={({a}_{i}-{b}_{i})}^{2}\). In order to combine both domain evaluations, the Classification Contribution Degree (CCD) is defined, as described by the following equation: \({F}_{i}=\frac{{R}_{i}}{{S}_{i}+{A}{\prime}}\), where \({A}{\prime}\) is a correction coefficient to avoid the denominator to be\(0\). Finally, the oversampling will be conducted proportionally to the CCD value based on the SMOTE methodology
In order to evaluate the efficacy of the proposed OS-CCD approach, a rigorous performance analysis was conducted, comparing it with other state-of-the-art techniques, including ROS [24], SMOTE [23], k-means SMOTE [3], NRAS [42] and Gaussian-SMOTE [43]. The performance of these methods was assessed using 12 datasets. The experimental results demonstrated that OS-CCD outperformed the other techniques in terms of average AUC values on 10 out of the 12 datasets evaluated. This superiority was not only limited to higher average AUC values but also reflected in the reduced standard deviation, indicating the method's robustness and stability across the datasets. When OS-CCD did not guarantee the highest performance, the authors realized that data had a significant number of outliers
(1) Selection of initial samples to be oversampled
(2) Type of interpolation
The proposed approach consists of a framework comprising three essential phases: initial minority sample filtering, safe radius distance calculation, and generation of new samples. In a first instance, a \(k\)-NN classifier with \(k=5\) is used to categorize the misclassified samples as noise or, on the other hand, if correctly classified, samples are considered safe. Noisy samples are discarded and will not be utilized to generate synthetic data. Afterwards, the safe radius, \({r}_{ij}\), is defined as the distance to the nearest majority class sample. As such, new synthetic samples, \({a}_{ij}\), are generated within the previously defined hypersphere through interpolation in two directions, as described in the following formulation: \({a}_{ij}={P}_{j}+\left(rand\left(\text{0,1}\right)\times \left({p}_{j}\pm {r}_{ij}\right)\right)\) where \({P}_{j}\) refers to the minority sample under study
The comparison of Radius-SMOTE with other resampling methodologies, such as SMOTE [23], Borderline-SMOTE [28], ADASYN [30], SMOTE-IPF [37] and SMOTE-TL [45] in 13 datasets from the KEEL repository, revealed that it guaranteed the highest performance in 5 of the 13 datasets. Complementarily, tests were conducted with the addition of class and attribute noise, wherein Radius-SMOTE was found to be robust to both class and attribute noise
Methodology directed to address the limitations of SMOTE in high-dimensional spaces containing redundant or noisy attributes. This variant of SMOTE suggests a novel means of computing the nearest neighbours for each sample of the minority class, since a distinct importance is imposed on each variable based on feature importance. To this end, the authors proposed using the Induced Minkowski OWA distance (IMOWAD) operator to compute the distance between data points, which corresponds to a weighted average of the Minkowski distance between two points, \({x}_{i}\) and \({x}_{{i}{\prime}}\), whose formulation is the following: \(IMOWAD(\left({u}_{1},{x}_{i,1},{x}_{{i}{\prime},1}\right),\dots ,\left({u}_{1},{x}_{i,n},{x}_{{i}{\prime},n}\right))={\left(\sum_{j=1}^{n}{w}_{j}\times \left|{x}_{i,j}-{x}_{{i}{\prime},j}\right|\right)}^{p}\), where \(u\) is an order inducing variable and \(\sum_{j=1}^{n}{w}_{j}=1\). To obtain the weight vector, \({\varvec{w}}\), different variants of the Regular Increasing Monotone (RIM) quantifier are used [47]. In practice, this methodology initiates by employing a filter method, such as Fisher-score, Mutual Information, Correlation Score, or Eigenvector Centrality, to identify and rank a subset of relevant features. Consequently, during the computation of nearest neighbours, the IMOWAD algorithm is utilized, wherein each feature is assigned a weight proportional to its importance as determined by the feature selection method. The subsequent steps of the procedure align with the standard SMOTE technique
Experimental evaluation of this approach was conducted on 42 benchmark datasets from UCI and KEEL repository. The comparison of FW-SMOTE with other variants of SMOTE, namely SMOTE [23], Borderline SMOTE [28], Safe-level SMOTE [31], ADASYN [30], Adaptive Neighbour SMOTE [48], DBSMOTE [33], MWMOTE [32], Relocating Safe-level SMOTE [49], the RBO technique [50], and PCA in combination with standard SMOTE, showed that FW-SMOTE is clearly superior, as evaluated by Average rank computed by the Friedman test with Iman-Davenport correction, average AUC and average G-mean
(1) Selection of initial samples to be oversampled
(2) Adaptive generation of synthetic examples
The proposed method is based on the utilization of Gaussian Mixture Models (GMMs) to effectively model the probability density of each instance, thereby enabling the adoption of an appropriate resampling protocol. Initially, two GMMs are constructed: one to capture the probability density of the minority class and another to represent the majority class. Subsequently, the instances whose probability density of belonging to its class, \({P}_{i^{\prime}}\), is lower than the probability density of belonging to the complementary class, \({P}_{i^{\prime}}\), are considered noise, as reflected in the following equation: \({P}_{i}<{P}_{i^{\prime}}\times \lambda \), where \(\lambda \) controls the degree of aggressiveness in the selection of outliers. In a subsequent phase, the samples are further categorized into safe, boundary, and outliers based on a similar evaluation conducted for noise removal. Finally, depending on the classification of each minority instance, a distinct number of nearest neighbours is employed for oversampling. Specifically, 5 nearest neighbours are utilized for safe samples, 3 nearest neighbours for boundary samples, and 1 nearest neighbour for outliers. This approach aims to mitigate the likelihood of inducing class overlap during the oversampling process
The authors evaluated the performance of GSMOTE in 24 UCI binary datasets against other SMOTE variants, namely ROS [24], SMOTE [23], Borderline SMOTE [28], Safe-Level SMOTE [31], SMOTE-IPF [37], GG-SMOTE [51], RNG-SMOTE [51] and NCN-SMOTE [51]. In terms of G-mean metric, GSMOTE produces 14,10 and 13 best results with the SVM, Logistical Regression and C4.5 decision tree classifiers. Additionally, it was verified that SMOTE variants that performed noise filtering prior to oversampling and/or utilized adaptive generation of samples algorithms significantly improve performance when compared to SMOTE
(1) Selection of initial samples to be oversampled
(2) Operations with dimensionality changes
(3) Adaptive generation of synthetic examples
(4) Filtering of noisy generated instances
The proposed method comprises two major components. Firstly, a weighted Manhattan distance metric is employed to identify the \(k\)-nearest neighbours for each sample. The weighting is performed using Information Gain or alternative techniques such as Wojna1 and Wojna2 [53]. Subsequently, the samples are classified into three categories: safe, borderline, or outlier, based on the number of majority samples within the \(k\)-nearest neighbour set of each minority sample. Upon removing noisy samples, the method utilizes the concept of k-NN hub, which involves analysing the nearest neighbours of the minority samples. This analysis provides valuable information for determining the appropriate level of oversampling (magnification) required and selecting the relevant samples for the oversampling process. Detailed method explanation on [52]. Lastly, a procedure for avoiding the generation of irregular data is inserted
The performance evaluation of AWH-SMOTE was conducted on nine numerical binary data classes sourced from the Keel data repository. To assess the effectiveness of AWH-SMOTE, its performance was compared against four well-established algorithms, namely ROS [24], SMOTE [23], Borderline-SMOTE [28], and ADASYN [30]. The C4.5 classifier was utilized. By weighting attributes for nearest neighbours’ computation and introducing the noise removal procedure, up to 9% increases in performance were registered. Using information gain to perform attribute weighting was the most competitive approach
This method initiates by randomly selecting a sample from the minority class, denoted as \({x}_{center}\). Subsequently, G-SMOTE employs the concept of a hypersphere centred at \({x}_{center}\) to generate new samples uniformly distributed within this hypersphere. To define the radius of the hypersphere, another sample, \({x}_{surface}\), is selected. The selection of \({x}_{surface}\), can be accomplished using three different modes of operation: (1) randomly selecting a sample from the nearest neighbour of the minority class (similar to SMOTE), (2) choosing the closest sample of the minority class, or (3) selecting the closest sample regardless of its class (either minority or majority). After determining \({x}_{surface}\), a sample is randomly generated on the surface of the hypersphere. This sample is then translated from the surface to the interior of the hypersphere by moving along the vector direction formed by \({x}_{center}\) and \({x}_{surface}\). Subsequently, a second sample is generated within the defined hypersphere by applying transformations to the probability distribution function that governs the generation of new samples. These transformations aim to limit the tendency of SMOTE to overfit
Tests were conducted 28 benchmark datasets, where the performance between G-SMOTE versus ADASYN [30] and original SMOTE [23] were compared. G-SMOTE ensures the highest average AUC, G-mean and F1-score across all datasets
Methodology that aims to mitigate the oversampling of outliers that may induce overfitting. To this end, this method starts with the application of the original SMOTE, followed by the implementation of an outlier removal method, namely Local Outlier Factor (LOF). The LOF score of a sample correspond to the quotient between the average local density of its \(k\)-nearest neighbours and the local density the sample under analysis. As such, outliers are characterized by having a lower local density than its nearest neighbours, thus leading to higher LOF values [56]
Experiments in this paper aimed to gauge the impact of outlier removal after SMOTE. Therefore, a comparative analysis was conducted between the original SMOTE [23] and SMOTE-LOF on three open-source databases. Accuracy, precision, recall, F1-score, AUC were utilized. It was found that SMOTE-LOF guarantees 1–6% increase in F1-score compared to SMOTE, thereby illustrating the relevancy of data cleaning after SMOTE
The present approach is designed to mitigate the drawbacks observed in the ENN and IPF cleaning methods. These conventional techniques tend to eliminate minority samples located in regions with low density or class overlap, which in turn hinders the classifier's overall generalization performance. As such, the authors proposed a modification of the heterogeneous value difference metric (HVDM) used in ENN that takes into account the local imbalance and spatial sparsity, as explained in the following formulation: \({d}_{W}^{IR, m}\left({x}_{1},{x}_{2}\right)=\left\{\begin{array}{c}{e}^{{{(IR}^{+})}^{m}\times {d}_{HVDM}\left({x}_{1},{x}_{2}\right), if {x}_{2} is positive;} \\ {e}^{{{(IR}^{-})}^{m}\times {d}_{HVDM}\left({x}_{1},{x}_{2}\right), if {x}_{2} is negative;}\end{array}\right.\), where \({IR}^{-}\) and \({IR}^{+}\) correspond to the percentage of majority and minority samples in the dataset, respectively, and \(m\) represents the dimensionality of the dataset. If we have a classification problem with a very high imbalance rate \(({IR}^{-}\to 1)\) it is natural that the instances of the minority class are more dispersed and therefore more distant from each other. Thus, the introduction of the aforementioned factor moves the samples away from the majority class by a factor \(e\) and keeps the minority samples in almost identical positions, thus promoting their local density. In practice, this method foresees the application of SMOTE followed by the modified version of ENN, designated WENN
Tests were conducted on 10 artificial databases and 10 real databases, where the performance of SMOTE-WENN was evaluated in relation to SMOTE-IPF [37], SMOTE-ENN [58] and NB-Comm [59]. It was observed that, by taking a cautious approach to eliminate minority samples in regions prone to class overlap, SMOTE-WENN successfully bolstered the generalization performance of the models. Notably, when analysing the results obtained from the real-world databases, SMOTE-WENN showcased a significant improvement (AUC, G-mean) over SMOTE-ENN and SMOTE-IPF, as statistically confirmed by a Wilcoxon signed rank test with a 95% confidence interval
NaNSMOTE aims to address the dependency of SMOTE on the parameter \(k\), which defines the number of nearest neighbours used for the generation of synthetic samples. Hence, the authors propose the use of Natural Neighbours (NaN), where a sample \({x}_{j}\) is NaN of \({x}_{j}\) if \({x}_{i}\) belongs to the nearest neighbours of \({x}_{j}\) and \({x}_{j}\) belongs to the nearest neighbours of \({x}_{i}\). An advantageous aspect of the proposed approach is the adaptive value of λ assigned to each sample, reflecting its NaN search count (refer to [60] for an in-depth explanation) and dynamically adapts to the complexity of the dataset. Based on this definition, outliers are categorised by not having NaNs. In practice, NaNSMOTE entails the determination of NaNs followed by the removal of outliers. Consequently, for each sample of the minority class, the respective NaNs are selected, and new samples are generated. While SMOTE multiplies the difference between samples by a random value ranging from 0 to 1, NaNSMOTE employs a different strategy. In NaNSMOTE, each attribute is multiplied by a distinct random value, introducing greater diversity and reducing potential biases during the synthetic sample generation process
At first, NaNSMOTE was evaluated on 20 datasets from the UCI repository. In this experimental context, NaNSMOTE consistently guaranteed the lowest average rank when compared with SMOTE [23]. Secondly, a comparative analysis between other oversampling methods reported in the literature and their combination with NaNSMOTE was performed. In this context, Borderline-NaNSMOTE, Safe-Level-NaNSMOTE, NaNADASYN, NaNSMOTE-ENN, NaNSMOTE-IPF, and NaNSMOTE \(k\)-means were tested. It was found that the utilization of NaNs for neighbourhood determination yielded increased performance. To establish the statistical significance of these results, a nonparametric two-sided Wilcoxon signed ranks test was conducted, with a significance level of 0.1
This method firstly applies \(k\)-means clustering to the majority class. After determining the centre of each cluster, the \(k\)-nearest neighbours of the minority class are determined, which are used to perform oversampling based on Random-SMOTE. Afterwards, an algorithm is applied to remove samples that generate class overlapping. This algorithm firstly identifies the minority class and calculates its centre. It then iterates through each sample in the minority class and checks whether all its k nearest neighbours belong to the majority class. If so, the sample is added to a set of overlapping samples. Then, for each overlapping sample, the algorithm computes the distance between the sample and the centre of the minority class, as well as the distance between each nearest neighbour of the sample and the centre. If the distance between the sample and the centre is smaller than any of the distances of the corresponding neighbours, the sample is removed from the set. Finally, the algorithm deletes all samples in the set of overlapping samples. This process is also repeated for the majority class
Tests were conducted on 10 imbalanced datasets from the UCI and KEEL repository. By resorting to the SVM classifiers, authors verified that the proposed method guaranteed the highest G-mean and F1-score when compared to SMOTE [23] and Random-SMOTE [62]
Initially, the method calculates the Euclidean distance between the points of the minority class and the geometric centre of the minority class. By identifying the most distant points, regarded as noise, these samples are eliminated. Subsequently, an adapted version of SMOTE is employed, where new samples, \({x}_{j}{\prime}\), are generated within an n-dimensional rectangle defined by the sample under analysis, \({x}_{j}\), and a randomly selected nearest neighbour, \({y}_{j}\). Finally, synthetic samples are discarded if they do not satisfy two conditions: (1) \(\Vert {x}_{j}{\prime}-{x}_{j}\Vert >\Vert {y}_{j}-{y}_{j}\Vert \) and (2) the nearest neighbour of the synthetic sample must belong to the minority class
The authors implemented CR-SMOTE as a resampling method to classify the severity of software bugs. 3 open-source datasets were used: Eclipse, Mozilla and GNOME. CR-SMOTE registered the highest F1-score compared to the application of SMOTE [23], RUS, ROS [24] and a cost sensitive method
This method envisages mitigating the presence of overlap and noisy samples after SMOTE through the weighted and joint application of ENN and Tomek-Links. According to the authors, the working principle of ENN makes it more capable of eliminating the presence of noise, while Tomek-Link mainly acts upon class overlap mitigation. To quantify the levels of overlap and noise, the authors propose utilizing the error of an LDA classifier and the k-Disagreeing Neighbours measure. These estimates are then used to compute a weighted percentage, guiding the elimination of samples through either ENN or Tomek-Links
The proposed approach was tested on 10 UCI open-source repositories, where a comparative analysis between the proposed method and SMOTE, Condensed Nearest Neighbours [65], Easy Ensemble [62], SMOTE-ENN [58], SMOTE-TL [45] and SMOTE-IPF [37] was performed. In 4 of the 10 datasets, there were significant differences (Tewey's range test) between the proposed method and the other approaches. As for the remaining datasets, although the differences are not significant, the proposed methodology generated identical or slightly higher F1-scores
Firstly, the t-SNE dimensionality reduction technique is applied to reduce the dimensionality of the data to a two-dimensional space. This is done to mitigate errors that may arise from computing k-nearest neighbours based on the Euclidean distance in high-dimensional spaces, as previously suggested by [46, 52]. Following that, the \(k\)-nearest neighbours of each instance of the minority class are determined and what the authors call shadowsamples are created, which are obtained by adding noise from a normal distribution,\(N(0, h({\sigma }_{f} ))\), for all features f ∈ F, where \(h({\sigma }_{f} ))\) is some function of the sample variance \({\sigma }_{f}\) for the feature \(f.\) Finally, synthetic samples are generated through Random affine combination with positive coefficients (Convex combination) of the chosen shadowsamples. Note that a vector, \({\varvec{v}}= {\alpha }_{1}{\mu }_{1}+\dots + {\alpha }_{m}{\mu }_{m}\) is considered a Random affine combination of vectors \({\mu }_{1},\dots , {\mu }_{m}\) if \({\alpha }_{1}+\dots +{\alpha }_{m}=1\) and \({\alpha }_{1},\dots ,{\alpha }_{m}\) are randomly selected from a Dirichlet distribution
To assess the developed methodology, 11 scikit-learn imbalanced benchmark datasets were selected which held a high imbalance ratio (higher than 25) and high dimensionality (more than 100 features). It was verified that the balanced Accuracy and average F1-score were higher compared to SMOTE [23], SVM-SMOTE [67], Borderline-SMOTE [28] and ADASYN [30]
The table provides a detailed overview of the taxonomy, methodology, and main conclusions of each study. Note that, the mathematical notation used in the description of each method corresponds to that utilized in the original articles, simplifying cross-referencing for further consultation, if necessary
Table 4
Comprehensive summary of recently reported works in the literature focusing on undersampling methods
Method that aims to maximize the visibility of the minority class in the presence of overlapping. In this paper, the authors propose 4 neighbourhood-based undersampling methods, namely, Neighbourhood Search (NB-Basic), Modified Tomek Link Search (NB-Tomek), Common Nearest Neighbours Search (NB-Comm), and Recursive Search (NB-Rec). The first method envisages eliminating all instances of the majority class that hold at least one nearest neighbour of the minority class. The second method arises due to the tendency of NB-basic to excessively eliminate majority samples and, consequently, increase the loss of relevant information for model training. To mitigate this, NB-Tomek employs a similar neighbourhood analysis as NB-basic, but selectively removes points only if the previously selected majority samples form a Tomek-Link with the minority sample. As for the third method, Common Nearest Neighbours Search, a matrix is created that records the frequency that a sample from the majority class was included in the nearest neighbours of samples from the minority class. Then, if a majority sample has a frequency greater than 2, it will be eliminated. Finally, NB-Rec represents an extension of the NB-Comm algorithm and incorporates a two-step process. Initially, NB-Comm is applied to eliminate samples as described above, which will be save in a set \(X\). Subsequently, NB-Comm is reapplied to \(X\). In this manner, majority samples that are at least neighbours of two samples belonging to \(X\) are also eliminated
Three types of tests were performed: (1) on artificial datasets with different imbalance ratios and degree of overlap, (2) 24 datasets from KEEL repository and (3) breast cancer dataset from KDD Cup 2008 and handwritten digit dataset from MNIST database. In test (1), NB-Rec showed the highest sensitivity (99.95%) and competitive G-mean, however, it showed low F1-score and precision compared to SMOTE, suggesting an increase in false positives. NB-Comm showed slightly better overall results than NB-Basic and NB-Tomek. In test (2), NB-Comm had the highest F1-score and G-mean, significantly outperforming other undersampling methods such as ENN [77], kmUnder [81] and OBU [82], as well as SMOTE [23] and BorderlineSMOTE [28]. In test (3), NB-Rec produced the highest sensitivity, yet experienced a huge decline in accuracy
Resorting to the idea of mutual class potential to perform informative undersampling. Mutual Class potential corresponds to a real-valued function reflecting the degree of affiliation of a sample with respect to the minority and/or majority class. To estimate this mutual class potential, the authors proposed the use of Gaussian radial basis function (RBF), which enable the calculation of the potential, \(\phi \), for each sample, \(x\), based on the following formulation: \(\phi \left(x,K,k,y\right)= \sum_{i=1}^{\left|K\right|}{e}^{-{(\frac{{\Vert {K}_{j}-x\Vert }_{2}}{y})}^{2}}- \sum_{i=1}^{\left|k\right|}{e}^{-{(\frac{{\Vert {k}_{j}-x\Vert }_{2}}{y})}^{2}}\), where \(K\) and \(k\) correspond to the majority and minority class elements, respectively, and \(y\) is a parameter that affects the spread of the RBF. Based on this potential, it is perceptible that higher values indicate a high concentration of samples from the majority class and vice-versa. Assuming that samples with high values of mutual class potential will be somewhat redundant and uninformative, due to their similarity to other samples, these are the best suited to be eliminated. In practical terms, this algorithm provides the calculation of \(\phi \) for all samples followed by the elimination of the sample with the highest \(\phi \). This process is repeated iteratively until the dataset is balanced
Fifty datasets from the KEEL repository (at least 12 minority data points and AUC less than 0.85 when applied an SVM without resampling) were used. 11 other undersampling approaches, 4 oversampling approaches, and 2 hybrid methods were used to comparatively evaluate the capabilities of RBU. RBU recorded a higher F1-score, AUC, Accuracy and G-mean when using the NB classifier. By analysing the dataset characteristics (number of safe, borderline, rare and noisy samples as proposed by Napierala [15]), it was evident that RBU performed better on difficult datasets, i.e., with high proportions of rare and noisy samples
Homogeneous Consensus Clustering-Based Undersampling and Heterogeneous Consensus Clustering-Based Undersampling [22]
Clustering based
Methodology based on the use of clustering to identify instances of the majority class that should be retained during the undersampling process. Recognizing the challenges associated with defining the appropriate clustering algorithm and its parameterization, the authors propose two frameworks: Homogeneous Consensus Clustering-Based Undersampling and Heterogeneous Consensus Clustering-Based Undersampling. In the first method, the number of clusters is first defined as the number of samples in the minority class. Then, the same clustering algorithm is applied with different parameterizations (the number of clusters remains constant), and the obtained partitions are merged by applying a consensus function (simple voting function, incremental voting function or label correspondence search algorithm), generating a final partition. Finally, the centres of the clusters in the final partition will correspond to the new majority class. The second algorithm differs by considering 5 clustering algorithms (K-means, K-modes, K-means + + , self-organizing maps and DIANA algorithm) instead of the same algorithm with different parameterizations
Forty-four datasets from KEEL repository and 2 large-scale datasets, namely protein homology prediction and Twitter sentiment datasets. 7 state-of-the-art methods were used to benchmark the performance of the proposed approaches, namely UnderBagging4 (UB4) [84], UnderBagging24 (UB24) [84], RusBoost1 (Rus1) [84], SMOTEBagging4 (SBAG4) [84], UnderBagging1 (UB1) [84], cluster-based undersampling based on cluster centres (Centres) [81], Cluster Centres (CC) [81]. Undersampling based on heterogeneous consensus clustering produced the highest average AUC when using the consensus function based on label correspondence search
The proposed methodology focuses on the utilization of Gaussian Mixture Models (GMMs) to model the majority class and enable efficient undersampling. In practical terms, this method firstly iteratively computes the Bayesian information criterion (BIC) for GMMs with different numbers of components and types of covariance matrices). Based on this criterion, which seeks the best balance between the model complexity and the ability of describing the data distribution, the optimal parameterization is determined. Then, the GMM is used to calculate the sample of the minority class with the highest log-likelihood, designated "cross-edge", which is expected to be located at the boundary between classes. Finally, majority samples are eliminated, in identical quantities, above and below the cross-edge
The tests were performed on 16 open-source datasets (imbalance ratio between 1.84 and 129.4). To comparatively evaluate the proposed method, RUS, Cluster Centroids [81], TL [76] and ENN [77] were used. C4.5 and C4.5-based bagging integrated learner were the tested classifiers. GMUS generated the highest AUC in 13 datasets and the highest F1-score in 12 out of 16 datasets. Additionally, tests were conducted on a credit card transaction dataset, where GMUS guaranteed the highest AUC over all other methods
Method that uses the Bayes Theorem and the concept of sensitivity, which in this case has a different meaning than the widely adopted performance metric for imbalanced domains, to determine the samples more susceptible to class imbalance and, consequently, more relevant to the classification problem. Sensitivity corresponds to the difference of posterior probability considering the real classification context (where there is class imbalance and as such the prior probability of the minority class is much lower than the prior probability of the majority class) and considering identical prior probabilities. In practical terms, samples near the boundary between classes will have a higher sensitivity and vice-versa. To implement this algorithm, the authors suggest computing the likelihood function based on a k-NN approach [87] and the prior probability through the class ratio. Hence, it is possible to determine the sensitivity, \(S\left({x}_{j},{y}_{j}\right)\), for each sample of the minority class. Finally, the sampling probability, \(W\left({x}_{j},{y}_{j}\right)\), is determined based on the following formulation: \(W\left({x}_{j},{y}_{j}\right)=\frac{S\left({x}_{j},{y}_{j}\right)+1}{\sum_{j=1}^{N\_}+N\_}\), where \(N\_\) is the number of negative examples. Based on the sampling probability, samples are eliminated until a balanced dataset is obtained
Tests were conducted on 20 datasets from the KEEL repository. Additionally, 5 resampling methods were used to comparatively evaluate the performance of the proposed method, namely ROS, SMOTE [23], RUS, IHT [79] and CC [73. The CART classifier was used. Compared to the other methods, USS outperforms NONE, ROS, SMOTE, RUS, IHT and CC in 19, 19, 16, 20, 15 and 15 out of 20 datasets respectively. Additionally, by performing a Wilcoxon signed-ranks test with a significance level of 0.05, it was possible to verify that USS significantly outperforms the other resampling approaches
The proposed algorithm, called AdaOBU, is a selective undersampling process based on the Fuzzy C-means clustering algorithm designed to address the challenge of class overlap in imbalanced domains. Fuzzy C-means is a soft clustering technique that allows each sample to belong to multiple clusters, with membership degrees indicating the likelihood of belonging to each cluster. Within the AdaOBU algorithm, an initial step involves the application of Fuzzy C-means with a fixed number of two clusters. This step aims to obtain membership degrees for each sample, reflecting the extent of association with each cluster. Subsequently, the algorithm calculates the average membership degrees for the minority class, \(\overline{{\mu }_{neg}},\) and the majority class, \(\overline{{\mu }_{pos}}\). To determine the threshold for selective undersampling, denoted as \({u}_{th}\), the algorithm identifies the minimum value between \(\overline{{\mu }_{neg}}\) and \(\overline{{\mu }_{pos}}\). This threshold serves as a criterion for selecting the samples from the majority class that exhibit higher membership degrees than \({u}_{th}\), while eliminating the remaining samples. It is important to note that this method incorporates an adaptive elimination strategy, where the number of samples removed is determined by the fuzziness of the dataset, which directly relates to the extent of class overlap
Firstly, experiments were conducted on 66 artificial datasets with different degrees of imbalance and overlap. AdaOBU registered competitive sensitivity and G-mean with SMOTE, KmUnder and BLSMOTE at higher imbalance degrees, however it generated lower specificity, thus suggesting high occurrence of false positives which most likely resulted from excessive removal of negative samples. Next, 66 datasets from the KEEL and UCI repository were tested. The SMOTE [23], Borderline-SMOTE [28], KmUnder [81], SMOTE-ENN [58], SMOTEBagging [88] and RUSBOOST [89] methods were used for comparison. Again, AdaOBU was found to have the highest sensitivity, but the lowest specificity. Finally, larger datasets were used, namely the breast cancer dataset from KDD Cup 2008b dataset and the handwritten digits dataset from the MNIST database. Again, the same trend as the previously conducted experiments was observed
This undersampling method envisages two key steps for addressing the presence of overlapping instances in imbalanced datasets. The first step involves the detection of the overlapping region, while the second step focuses on the cleaning of overlapping instances and undersampling the majority class. To identify the overlapping instances, the authors suggest using One-class SVMs to detect overlapping instances from both classes, meaning this method Is applied individually for both classes. In the second phase, the algorithm identifies all Tomek-Links present in the dataset. Subsequently, all Tomek Links belonging to the previously determined overlapping instances are eliminated, which are referred to as Abnormal Tomek-Links. As for the remaining Tomek-Links, 3 scenarios may occur: (1) if the minority sample of the Tomek-Link belongs to the overlapping region and the corresponding majority sample does not, the minority sample will be kept; (2) if the majority sample of the Tomek-Link is in the overlapping region and the minority sample is not (inverse of case (1)), a Neighbourhood Analysis is performed with 5 nearest neighbours to the majority instance. Should the number of neighbours of the same class be less than the number of neighbours of the minority class, the minority class will be eliminated. (3) when both samples are not in the overlapping region, they are considered Normal Tomek-Links (NTL). In this context, a redundancy analysis is conducted to eliminate the least significant samples, as discussed below. Firstly, the nearest neighbour of each NTL is determined (majority samples of the NTL are represented as \({x}_{d}^{{\prime}{\prime}}\)), which is called the redundant pair, \({x}_{e}\). Then the midpoint (centroid) of the majority class is determined. Finally, the \({x}_{d}^{{\prime}{\prime}}\) are eliminated if their distance from the centroid is greater than the distance of their \({x}_{e}\) from the centroid
The algorithm was tested on 9 UCI repositories (7 binary and 2 multiclass). In a first instance, a comparative analysis was conducted with 6 resampling approaches, namely NCL [77], NCL + BNF [91], OSS + BNF [91], OSS + TL [91], SMOTE + TL [45] and SMOTE + BNF [91]. The proposed scheme outperforms state-of-the-art techniques by achieving higher F1-score and specificity values. This improvement effectively reduces false positives while avoiding an increase in false negatives. Finally, another analysis was conducted versus cost-sensitive learning (cost-sensitive feed forward neural network [92], cost-sensitive SVM [93], and cost-sensitive Naïve Bayes [93]) and extreme machine learning approaches. The proposed resampling technique performed significantly well compared to CSL and ELM techniques in most cases the considered performance metrics
The Progressive Undersampling Method with Density (PUMD) aims to determine the minimum number of representative majority-class samples required to achieve optimal model performance. This approach is particularly relevant in big data contexts, as it contrasts with other data cleaning strategies that typically focus on removing only noisy or borderline instances. To identify the most significant samples, the authors propose ranking samples based on two factors: density (\(\rho )\), defined as \({\rho }_{i}=\sum_{j\in {I}_{s}\{i\}}{e}^{{\left(\frac{{d}_{ij}}{{d}_{c}}\right)}^{2}}\), where \({d}_{ij}\) is the distance between majority samples \(i\) and \(j\), and \({d}_{c}\) is a cut-off distance; and the relative distance to other potential cluster centres, as expressed via \({\delta }_{i}=\underset{j:{\rho }_{i}<{\rho }_{j}}{\text{min}}\left({d}_{ij}\right)\). Using these factors, the importance of each sample is quantified through the function \(f\left(\rho ,\delta \right)={\rho }^{2}\times \delta \), which attains high values when a sample resides at a cluster center. The algorithm operates iteratively, selecting groups of samples with the highest \(f\left(\rho ,\delta \right)\) values and evaluating classifier's performance on the resulting dataset. The process terminates when further performance gains are no longer observed
The algorithm was evaluated on 40 datasets using tenfold cross-validation and compared against six other undersampling methods: RUS, OSS [75], NCL [77], CBU, CNN [65], and ENN. The assessment focused on metrics such as AUC, precision, recall, and F1-score. Experimental results demonstrated that PUMD achieved the highest recall in 36 out of 40 datasets, the best AUC in 37 out of 40 datasets, and the highest F1-score in 36 out of 40 datasets
The table provides a detailed overview of the taxonomy, methodology, and main conclusions of each study. Note that, the mathematical notation used in the description of each method corresponds to that utilized in the original articles, simplifying cross-referencing for further consultation, if necessary
Table 5
Comprehensive summary of recently reported works in the literature focusing on hybrid resampling approaches
The proposed algorithm entails the alternate application of oversampling, via M-SMOTE, and undersampling through ENN [77], until a stopping criterion, \({G}_{mcc}\), is reached. M-SMOTE corresponds to a variant of SMOTE characterized by the adoption of an oversampling rate that oscillates according to the misclassifying rate of a Random Forrest (RF). More specifically, the oversampling rate, \({M}_{class\_rate}\), is given by the quotient between the number of misclassified instances of the majority class, \({M}_{miss\_maj}\), and the number of misclassified instances of the minority class, \({M}_{miss\_min}\), as depicted in the following formulation: \({M}_{class\_rate }= \frac{{M}_{miss\_maj}}{{M}_{miss\_min}}\times 100\%\). To facilitate the implementation of the hybrid approach, the algorithm incorporates several control variables. These include \({E}_{min}\), which stores the Matthews Correlation Coefficient (MCC) value obtained from the RF classifier after applying M-SMOTE oversampling. \({E}_{maj}\) stores the MCC value achieved through the RF classifier after performing undersampling using ENN. \(Eva\) represents the highest MCC value achieved during the iterative process, whether through oversampling or undersampling. Additionally, \({G}_{mcc}\) is initialized with a value of 0, whereby modifying this variable prompts the stopping of the algorithm. After defining the initial variables, the algorithm computes \({M}_{class\_rate }\) and performs oversampling with M-SMOTE, and then proceeds to determine \({E}_{min}\) on the new obtained dataset. If \({E}_{min}>Eva\), the algorithm proceeds to the implementation of oversampling and \(Eva\) assumes the value of \({E}_{min}\), otherwise \({G}_{mcc}= {G}_{mcc}-1\) leading to the termination of the algorithm. After applying ENN, \({E}_{maj}\) is computed and if \({E}_{min}>Eva\), the oversampling process starts again, and \(Eva\) takes on the value of \({E}_{maj}\). Otherwise, \({G}_{mcc}= {G}_{mcc}-1\) and the algorithm terminate
Ten UCI datasets were used to test the proposed method. Several other resampling methods were used to comparatively evaluate the hybrid RFMSE method, namely SMOTE [23], Combined Cleaning and Resampling (CCR) [97], Gaussian-SMOTE [98], k-means SMOTE [3], IHT [79], Radial-based under-sampling (RBU) [83], Meta learning (ML) [95], and finally two hybrid sampling algorithms, SMOTE-ENN [58] and CCR + IHT [79, 97]. Using RF as a classifier, RFMSE obtained the highest F1-score on 9 out of 10 datasets, thus illustrating its robustness. In addition, the authors found that the use of RF as the base classifier outperformed the use of bagging and boosting combined with RF. Complementarily, RFMSE produced the highest F1-score on the failed abortion dataset
The proposed method aims to address the overfitting tendencies of oversampling methods and the underfitting tendencies of undersampling methods through the development of a hybrid methodology. To perform oversampling, the authors first propose calculating the local density of each minority instance, enabling the adaptive generation of samples proportional to this value. Thus, the authors initially apply k-means clustering with the number of clusters equal to the square root of the number of minority samples. Additionally, the initial cluster positions are determined based on the methodology described in [100]. Subsequently, the local density of each sample is defined as the number of nearest neighbours belonging to the same cluster and is then converted into a probability distribution using the roulette wheel selection operator. Finally, the SMOTE procedure is applied based on the previously computed probabilities until an imbalance ratio of 2 is achieved. Moving on to the undersampling procedure, it follows the same methodology for estimating the local density of the minority class samples, enabling their conversion into probabilities using the roulette wheel selection operator. This allows for preferential undersampling of the most redundant samples. The undersampling process continues until an IR of 1 is reached
To test the performance of the CDBH, 44 datasets from the UCI repository are used, and 7 oversampling methods are used as controls, namely, SMOTE [23], SMOTE-TL [45], Safe-level SMOTE [31], SMOTE with Rough Set Theory (SMOTE-RSB) [101], Evolutionary Under-Sampling with CHC algorithm (EUSCHC) [102], AHC [103] and FRB + CHC [104]. CDBH registered statistically significantly (Friedman test) higher F1-score than all the other resampling methods by resorting to the SVM classifier. Additionally, authors utilized the number of support vectors as an indicator of computational complexity of the classification model, where the CDBH showed the lowest values thus indicating the obtainment of simplified yet representative datasets
This method introduces a novel approach that combines SMOTE and Reverse k-Nearest Neighbours (RkNN) to address the challenge of noise detection in imbalanced datasets. Unlike other methods, this approach considers a more global and comprehensive analysis of sample density, enabling the identification of noise through the application of RkNN after oversampling. In RkNN, the number of nearest neighbours of a sample, \({x}_{i}\), corresponds to all the other points that include \({x}_{i}\) in its neighbourhood. In practical terms, this method provides, in a first step, the application of SMOTE with \(k=5\), until a balanced dataset is obtained. Then, RkNN is applied separately with \(k=\sqrt{{n}_{samples}}\) for the majority and minority class, making it possible to obtain the number of nearest neighbours for each instance, \({r}_{i}\). Consequently, the transformation \({p}_{i}= \frac{{r}_{i}}{\sum_{i}^{n}{r}_{i }}\) is applied to obtain an estimate of the probability of belonging to each class. Finally, and grounded on the principle that noise is characterized by having a high probability of belonging to the opposite class, samples are eliminated based on the following formulation: \(\left\{\begin{array}{c}{x}_{i}\in Normal, if <\lambda {p}_{i}\\ {x}_{i}\in Noise, if >\lambda {p}_{i}\end{array} \right.\) where \(\lambda \) controls the degree of aggressiveness in selecting noisy samples
11 datasets collected from UCI machine learning repository, 29 datasets acquired from KEEL repository, two real-world bioinformatics datasets and four real Kaggle datasets were used in the experimental protocol. This method was compared with other hybrid resampling approaches, namely, SMOTE-TL [45], SMOTE-ENN [58], SMOTE-RSB [101] and SMOTE-IPF [37], and 3 classifiers were used: GNB. CART and LDA. SMOTE-RkNN obtained the best average in terms of G-mean against all classifiers, thus showing its robustness. Additionally, this method was found to be extremely capable on more complex data distributions
This study proposes a novel approach that utilizes the Particle Swarm Optimization (PSO) algorithm to address imbalanced datasets. In the PSO algorithm, a group of individuals referred to as particles move iteratively in a search space, with the objective of optimizing an objective function. This iterative process involves computing the objective function and adjusting the speed of the particles, ultimately leading to convergence of the algorithm. In this context, the PSO algorithm is applied to the majority samples, where the objective function is defined as the AUC value obtained via tenfold cross-validation. This metric is computed based on the new majority samples generated in each iteration, along with the minority samples. By optimizing the objective function, the PSO algorithm facilitates the selection of the most relevant majority samples. After the PSO-based optimization process, an oversampling method such as SMOTE, Borderline-SMOTE, ADASYN, or MWSMOTE is applied. To address the overfitting issue associated with oversampling methods, the PSO algorithm is again employed. This time, it aims to select a subset of optimal minority samples based on the same objective function used in the previous step
Ten datasets were chosen from the UCI repository to study the performance of the selected methodology. The j.48 decision tree was used as a classifier. The \(SS{O}_{maj}-SMOTE-SS{O}_{min}\), \(SS{O}_{maj}-BorderlineSMOTE-SS{O}_{min}\), \(SS{O}_{maj}-ADASYN-SS{O}_{min}\) and \(SS{O}_{maj}-MWSMOTE-SS{O}_{min}\) were compared with the baseline oversampling methods. The proposed framework generated significant AUC improvements when compared to the baseline oversampling approaches, especially on large imbalanced and large data sets
In the Ant-Colony Optimization Resampling (ACOR) method, the ant-colony optimization algorithm was employed to select a suboptimal subset of samples which maximized classifier performance, based on the G-mean as the objective function [107]. Practically, this method firstly entails the application of a predefined oversampling method followed by the ant colony optimization procedure
By comparing resampling methods versus their jointly application with ACOR in 18 UCI datasets, it was possible to verify that ACOR-SMOTE, ACOR-ADASYN and ACOR-BSO statistically outperformed the baseline resampling methods in terms of AUC, thereby illustrating the methodology’s robustness [107]
The Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE) method is designed to overcome the drawbacks of Borderline SMOTE, specifically the lack of algorithmic assurance that oversampled data points fall near the decision boundary, despite selecting samples from those regions. HCAB-SMOTE also addresses the risk of overfitting by incorporating undersampling into the process. Functionally, the algorithm identifies danger zones for both minority and majority classes using the Nearest Neighbours approach, akin to Borderline SMOTE, but eliminates samples from the majority class found in these regions. Following this, k-means clustering is applied to the minority class samples within the danger region, and the sample count for each cluster is calculated. Clusters containing fewer than half the samples of the largest cluster are discarded, and the remaining clusters are selected for oversampling. Synthetic samples are generated based exclusively on the nearest neighbors within the danger region
To evaluate the developed approach, 10 real-world datasets were used, including the Solar Flare Dataset, Sales Dataset, and Bank Dataset, among others. The F1-score and recall of HCAB-SMOTE were compared to SMOTE using fivefold stratified cross-validation. It was observed that HCAB-SMOTE consistently outperformed SMOTE, with an average F1-score improvement of 36.56%
This method consists of applying SMOTE, followed by the implementation of a novel undersampling approach, designated Synthetic Majority Undersampling TEchnique, (SMUTE) that aims to mitigate the loss of information associated with the undersampling process. SMUTE draws inspiration from SMOTE's interpolation concept but applies it in the undersampling context. For each instance in the majority class, \({x}_{i}\), a selective process is carried out to identify a k-nearest neighbor, \({x}_{h}\). Then, the formulation (11) used in SMOTE is applied to create a new sample, \({x}_{new}\). However, unlike SMOTE, in SMUTE, both \({x}_{i}\) and \({x}_{h}\) are eliminated, with only \({x}_{new}\) being retained
Firstly, the authors evaluated the SMUTE versus ROS [24] so as to ascertain its capabilities before benchmarking the hybrid model. 50 datasets from the KEEL repository were utilized as well as 3 classifiers (MLP, SVM e Logistic Regression). It was found that, independently of the used classifier, SMUTE registered statistically significant higher F1-score (Wilcoxon signed-rank test with a significance level of 0.05). Secondly, this method was compared with other state-of-the-art models, namely RUS, Near Miss [78], NCL [77], ROS, SMOTE [23], Borderline-SMOTE [28], SMOTE + TL [45] and SMOTE + ENN [58]. CSMOUTE yielded the lowest average rank with respect to the AUC and G-mean, when combined with the SVM and MLP. Finally, by applying an analysis based on instance type as proposed by Napierala [9], it was verified that this method excels when there are many outlier and borderline samples
A hybrid methodology aimed at minimizing the occurrence of class overlap while ensuring a balanced class distribution. To this end, this method foresees 3 distinct stages: (1) detection of overlapping regions; (2) evolutionary undersampling (3) random oversampling. Regarding the first stage, the authors define samples as being in overlapping regions in case these have at least 1 nearest neighbour of the minority class. By conducting this procedure, a set containing all samples in overlapping regions is created, referred to as \(o\_set\). As far as the evolutionary oversampling is concerned, there are two important aspects to consider. Firstly, only the samples of the \(o\_set\) are coded in order to minimize the computational complexity of this approach, as opposed to the whole dataset. Complementarily, the authors propose a fitness function that seeks to simultaneously minimize three aspects: (1) number of samples in the overlapping region previously defined, \(OR\); (2) Minimizing the imbalance ratio, \(IR\); (3) Minimizing the loss of information, which is evaluated as a function of G-mean based on a 1-NN classifier, \(GM\). Formally, the fitness function is given by the following expression: \(fitness= \alpha \frac{1-OR}{IR}+(1-\alpha )GM\), where \(\alpha \) controls the degree of contribution between the different variables. Finally, Random oversampling is performed to generate a balanced dataset. Note that the authors avoid using SMOTE due to its tendency to induce class overlap
100 datasets from KEEL repository were used to evaluate the performance of EHSO. Additionally, 7 undersampling methods, 5 oversampling methods, and 4 hybrid methods were used to comparatively evaluate the performance of the proposed method. Initially, the authors performed some parameter adjustments to the hybrid undersampling procedure, where it was found that \(\alpha =0.1\) and 30 iterations generated the most competitive performance. Secondly, the authors performed a modular evaluation of the method, where it was found that G-mean increased with the addition of the OR and IR parameter in the fitness function. Finally, and based on a decision tree classifier, EHSO was found to guarantee the highest Average AUC, G-mean and F1-score
The table provides a detailed overview of the taxonomy, methodology, and main conclusions of each study. Note that, the mathematical notation used in the description of each method corresponds to that utilized in the original articles, simplifying cross-referencing for further consultation, if necessary
As one can observe, this evaluation enables an adaptive determination of unsafe regions, characterized by class overlap or noise, which then guide the sample selection process, iterative sample generation/elimination, the data cleaning procedure, and so on. Note that the correct identification of problematic samples seems to be one of the most determining factors in the resampling methods’ success. One example of this fact was presented in [124], where the authors concluded that oversampling methods tend to achieve superior performances compared to undersampling approaches due to their ability to retain a higher proportion of safe samples while effectively eliminating the most unsafe samples, being safe and unsafe characterized based on the framework proposed by Napierala et al. [9].
Given the relevance of the complexity assessment in making resampling approaches more versatile and robust, it is evident that developing metrics or combinations thereof that can holistically assess the complexity of a classification problem, encompassing the type and severity of different data difficulty factors, holds utmost importance. In [79], the instance hardness of a dataset, evaluated through Eq. (13), was found to be more correlated with a linear combination of complexity measures evaluating different data difficulty factors, as opposed to individual complexity metrics. As such, more comprehensive complexity measures would provide resampling methods with more relevant information to tailor its behaviour, including, for instance, how to perform sample selection for oversampling or which samples to eliminated post-oversampling, according to the specific type of data. By tackling open research issues in complexity assessment, such as the need to combine multiple metrics for a more thorough estimation of classification complexity, and incorporating these methods into resampling approaches, it will likely enhance their ability to optimally manage a wider array of classification problems.
A final aspect to consider is that, occasionally, scientific works tend to evaluate the proposed methods on datasets with different degrees of imbalance ratios [40]. As discussed in "Class imbalance: formal definition", class imbalance is a very poor indicator of classification complexity and as such the conclusions of such analysis may be misleading. To address this, it would be beneficial to utilize some of the complexity measure exposed in Table 1, as they can effectively deal with imbalanced domains. By doing so, datasets can be more accurately grouped based on their data characteristics and new insights may be derived regarding the strengths and weaknesses of proposed methods.
Resampling approaches and recommendation systems
The comprehensive literature review conducted on this subject matter also demonstrated that the newly proposed methods rarely achieve optimal results across all datasets and under various experimental setups (different datasets with various imbalance ratios, dimensionality, etc.). This holds true even when compared to non-state-of-the-art techniques such as the original SMOTE. To illustrate this observation, Radius-SMOTE exhibited the highest performance in 5 out of 13 datasets [44], GSMOTE-NFM achieved the highest G-mean in 14, 10, and 13 out of 24 datasets when employing SVM [2], Logistic Regression, and C4.5 decision tree classifiers, respectively, while OS-CCD attained the best AUC value in 10 out of the 12 datasets analysed [41].
These findings underscore the notion that, despite the increasing adaptability of resampling approaches to address classification problems, the existence of a singular approach capable of consistently ensuring optimal performance across all scenarios remains uncertain, as would be theoretically expected under the No Free Lunch Theorem [83]. A prominent example of this observation is the efficacy of majority-class based methodologies, such as SWIM, in scenarios characterized by extreme class imbalance, where the available information pertaining to the minority class is deemed insufficient for accurate estimation of its distribution and characteristics [40]. Conversely, when dealing with datasets exhibiting a lower degree of imbalance, minority-based approaches appear to yield superior performance [23]. This discrepancy suggests that, while a given method may demonstrate versatility in handling diverse domains, specific data characteristics, such as extreme imbalance, may benefit from a complete paradigm shift in the fundamental approach employed by the methodologies. Considering these reasoning, a potential alternative for addressing this issue could involve the application of recommendation systems based on complexity metrics or other metrics which provide insights into the resampling methodology with the highest probability of achieving superior performance. Hence several works within this topic are discussed below.
Recommendation systems for resampling methodologies described in scientific literature can be broadly classified into two categories. The first category pertains to studies aiming to establish interpretable relationships or rules between the data characteristics and the resampling strategies, typically derived using data mining techniques such as Exceptional Preferences Mining. Contrarily, the remaining methodologies adopt a more conventional Meta Learning framework, defined as the process of adapting a learning system to a given type of data by utilizing knowledge derived from previous learning tasks involving similar data or data from other domains [125]. Formally, and as described by Parmezan et al. [125], this concept can be defined as follows: Let \(D\) be the problem repository, \(A\) the learning algorithm space, and \(Y\) the performance measure space. Given these assumptions, the goal of a meta-learning system is to map \(f({d}_{i})\) to\(A\), where \(f({d}_{i})\) correspond to a set of meta-features utilized to describe a classification problem (dataset). By doing so, the system can automatically recommend the algorithm \(a\in A\) which maximizes performance \(.\) In practical terms, this type of system must be trained before being integrated into machine learning pipelines. Within this framework, a set of meta-features is typically extracted from each dataset. These meta-features may include statistical properties, such as the number of samples in the dataset, or more complex metrics that provide insights into the presence of data difficulty factors, such as complexity measures. Subsequently, for each dataset, the most suitable resampling method (and potentially its parameterization) is typically determined experimentally, serving as the target variable, enabling meta-model training. Once the meta-classifier has been trained, its incorporation into ML pipelines requires the extraction of the selected meta-features relevant to the specific application. These meta-features are then used to infer the optimal resampling method, which is subsequently applied. A schematic representation of such systems is presented in Fig. 4.
Fig. 4
Schematic representation of a machine learning pipeline with an integrated resampling recommendation system (bottom), and the training procedure for such a system based on existing literature (top). The recommender can either be a set of interpretable rules (first model typology) or a classifier/ranker model that returns the resampling method with the highest probability of ensuring optimal post-resampling classifier performance (second model typology). Initially, the recommender system is trained on a meta-dataset, composed of meta-features and the optimal resampling methodology as the target variable. Utilized meta-features can vary broadly depending on the characteristics of the system, however, statistical features such as number of features or complexity measures are commonly adopted. In Machine Learning pipelines, the trained recommender systems is included prior to resampling in order to ascertain the optimal resampling approach
×
Regarding the first aforementioned approach, in [126], authors utilized 40 datasets sourced from KEEL, UCI, Scikit-learn library, and a medical dataset concerning infections contracted by patients in the Intensive Care Unit, to derive association rules that correlate data characteristics with the corresponding optimal resampling approach. The data characteristics (meta-features) employed included the number of features, number of samples, imbalance ratio, class overlap assessed based on the F1 complexity, and the presence of borderline instances evaluated by the N1 metric. To accomplish this objective, the authors initially evaluated the resampling technique (RUS, ROS [24], SMOTE [23], SMOTE + OSS, SMOTE + CNN, SMOTE + ENN [58], and SMOTE + TL [45]) that ensured the highest performance for each dataset, as measured by the Index of Balanced Accuracy (IBA). This measure was obtained through a tenfold cross-validation process. Following the creation of the meta-dataset, the Apriori Algorithm was employed for mining association rules, enabling the derivation of rules such as: if the imbalance ratio is less than 48, N1 ranges from 6.75 to 12.56, and F1 exceeds 78.8, then ROS is the most suitable resampling approach. The analysis of the obtained rules was consistent with prior results reported in the literature, indicating that no resampling produced superior results if the imbalance ratio was extremely low. Similarly, it was observed that SMOTE, when combined with CNN, OSS, and TL, was the most frequently chosen approach according to the derived rules, highlighting the effectiveness of hybrid methods. Moreover, other rules were induced not based on the IBA value, but from the rankings of resampled approaches, as assessed by Friedman and post hoc statistical tests. Lastly, it is worth noting that, despite the study's ability to infer rules about the relationship between data characteristics and suitable resampling approaches, it did not assess the model's generalizability on new data, thereby limiting the evaluation of the applicability of the drawn conclusions.
Alternatively, Luengo et al. attempted to evaluate the potential of complexity metrics in setting the optimal operating ranges for three resampling methods, namely SMOTE [23], SMOTE-ENN [58], and EUSCHC [102], by utilizing 44 datasets from the UCI repository [127]. The overarching aim was to establish a system that, given the characterization of classification problems using complexity measures, could predict the suitability of a specific resampling method. In this regard, a resampling method was considered to have a good behaviour if the test AUC, evaluated through fivefold cross-validation, surpassed 0.8, and the difference between the training and test AUC was less than 0.1. In contrast, a method was classified as having a bad behaviour if it did not meet one of these conditions. In practice, the determination of optimal intervals entailed a graphical representation of the AUCs yielded for each dataset, organized in an ascending manner according to the corresponding complexity value. Such an approach allowed for the identification of areas populated by consecutive values of bad or good behaviour, enabling the determination of the competence domains for each method. Through this procedure, the complexity measures F1, N4, and L3 proved capable of extracting the most significant intervals [127].
Lastly, we highlight the methodology suggested by Santos et al., where Exceptional Preferences Mining is used to procure interpretable rules implying the suitability of specific methods over others [128]. From a methodological perspective, this approach entailed the construction of a meta-dataset, where meta-features were extracted via the Python library pymfe [129] (namely statistical, complexity, landmark, among other features), and the target variables corresponded to the rankings of 9 resampling approaches based on the F1-score. The utilized resampling techniques were the following: ROS, SMOTE [102], SafeLevel-SMOTE [31], Borderline-SMOTE [28], ADASYN [30], AHC [103], ADOMS [130], SMOTE-TL [45], and SMOTE-ENN [58]. The evaluation of the results endorsed some claims made in "Class imbalance: formal definition" as well as the previously discussed work of Kraiem et al. [126], as it was observed that if a dataset's complexity is minimal, reflected by lower values of L2, N1, N4, and T3, or a higher percentage of safe points versus borderline points, the implementation of resampling approaches becomes redundant [128].
Addressing the second typology of methods previously mentioned, a study by Morais et al. advocated the utilization of simple attributes (such as the number of samples, attributes, presence of outliers, etc.) and statistical features to ascertain the most suitable undersampling method and its optimal parameterization, employing a k-NN based approach [131]. Practically, the recommendation process involves calculating the meta-features for the dataset under analysis, followed by the detection of the nearest neighbour amongst the existing meta-examples. The system’s output is the ranking of algorithms associated with the nearest neighbour. The undersampling techniques incorporated into this recommendation system were ENN [77], KMUS, MD, NCL [77], NearMiss-1 [78], NearMiss-3 [78], and RUS. However, varying parameterizations were tested for each resampling method, culminating in a total of 491 candidate resampling algorithms that could be chosen for an imbalanced input dataset. Experimentally, the authors utilized 29 publicly accessible datasets, and the optimal methods for each dataset were identified using a weighted performance metric that included Accuracy, AUC, F1-score, Specificity, and Negative Predicted Value. By evaluating the results, it was evident that the proposed system consistently outperformed a Random search approach, which involves choosing the best method from 50 randomly selected resampling methods and their corresponding parameterizations. Furthermore, it registered equal performance of a brute force search in 8 out of the 29 datasets [131]. In [132], the authors adopted a very similar approach based on k-NN rank aggregation to generate recommendations. However, the developed system was not only able to recommend resampling approaches, but also algorithmic level approaches, such as Bagging, AdaBoost [133], MetaCost [134], SMOTEBoost [135], among others. In addition, other meta-features were used, namely the complexity metrics proposed by Ho and Basu [136], Landmarking measures, model-based measures and structural measures. A thorough testing of the developed system on 80 unbalanced binary datasets from UCI, KEEL and Promise repositories yielded excellent results, specifically that in 96% of the times, the recommended approaches are within the top three techniques.
As we conclude this section, it is important to note that these systems are built upon a limited number of datasets, generally smaller in size and frequently corresponding to binary classification problems, thus restricting their capacity for generalization and integration into machine learning pipelines of various domains. In addition, in light of the discussion concerning the limitations of the Imbalance Ratio in predicting classification task difficulty and the bias of complexity measures in imbalanced regimes, it is likely that these systems could be enhanced with metrics that can more precisely assess the complexity of the classification task, such as those presented in "Classification complexity in imbalanced domains". Despite all the aforementioned aspects, this approach holds considerable potential due to its reduced computational cost and the ability to avoid a brute-force search when determining the optimal methodology, which seems to be the most common approach [131]. Furthermore, these systems obviate the need for users to possess deep technical knowledge about the most fitting resampling techniques, thus facilitating the application of these approaches in other areas.
Lastly, it is worth noting that, outside recommendation systems, there may be other viable approaches to tackle this issue such as the utilization of ensemble resampling methods to mitigate performance variance in different datasets. In [22], the authors proposed the application of cluster-based undersampling using different algorithms and parameterizations to minimize the influence of selected parameters, which have been demonstrated to impact performance. Moreover, the integration of complexity-based recommendation systems could prove valuable in determining the relative importance of base resampling methods based on the specific characteristics of the data.
Conclusion
The pervasiveness of the class imbalance problem across various sectors, coupled with its significant impact on the performance of machine learning algorithms, attests to its critical relevance in the contemporary settings, where societal dependence on these types of systems is steadily escalating. Given that data-level approaches are the most commonly employed techniques to tackle this issue, this paper has undertaken a comprehensive literature review concerning recent resampling approaches reported in the literature. Furthermore, considering the reliance of these strategies on evaluating data difficult factors to implement a suitable resampling strategy, this review was conducted from a data-oriented perspective. Hence, in the first stage, predicated on the formal description of class imbalance, it was discerned that imbalanced domains amplify existing data irregularities, such as class overlap or existence of small disjuncts, instigating a considerable augmentation in the complexity of the classification task, potentially leading to the total disregard of the minority class by the classifier. Owing to the significance of the interplay between class imbalance and other data difficulty factors, we investigated multiple methodologies that facilitate their simultaneous quantification, endowing the user with a more extensive grasp of the challenges intrinsic to each imbalanced classification task. Simultaneously, an overview of the open research problems associated with such quantification were provided.
In relation to resampling strategies, it was evident that these are progressing towards becoming increasingly adaptive to domain characteristics, thus being reliant on accurately pinpointing problematic regions, i.e., areas populated by the previously mentioned data difficulty factors. This emphasizes the need for research in this area to be conducted in tandem with theoretical advances in classification complexity assessment. Specifically concerning oversampling and undersampling methods, a dichotomy was noted regarding the adopted methodological approaches. Particularly in oversampling, there were methods that aimed to perform oversampling in regions with higher instance hardness, typically located between class boundaries, given that this is considered the most decisive region for model training. This strategy naturally fosters a more assertive expansion of the minority class. In contrast, with the intention of minimizing the probability of oversampling in overlapped or noisy regions, other methods carry out oversampling predominantly in areas with lower instance hardness. Therefore, this technique envisages a larger consolidation of the minority class in regions where it already exhibits considerable density. For the approaches reported in the context of undersampling, the same duality is present. Some methods propose removing majority samples from densely populated and consequently redundant regions, thereby mitigating the loss of information. On the opposite end of the spectrum, undersampling is performed at the class boundaries to minimize overlap between classes, as this phenomenon is recognized as one of the major factors negatively impacting the performance of learning algorithms. Evidently, there is always a trade-off between the different aforementioned typologies, where a more aggressive expansion of the minority class leads to greater sensitivity at the expense of specificity and vice versa. Therefore, the selection of a resampling methodology should always consider the specific application and user preferences. In the context of Deep Learning, it was verified that oversampling techniques are predominantly utilized due to these models' heavy reliance on large data volumes. Lastly, it became apparent that balancing class distributions does not inherently mitigate all biases within the data, since most resampling techniques overlook the potential amplification of biases related to sensitive attributes such as age or gender. This underscores the imperative to combine resampling approaches with other fairness strategies, fostering equitable and ethical AI systems.
Moving onto the analysis of the experimental results from the resampling approaches, one aspect is particularly prominent: no method guarantees superior performance across all tested datasets, even when compared with non state-of-the-art resampling methods such as ROS or SMOTE. This observation implies that, although resampling approaches are becoming more adaptive, they continue to fall short of optimally addressing all types of domains. Therefore, this review performs an evaluation of possible solutions to address this predicament, with a particular focus on resampling recommendation systems documented in the literature for imbalanced scenarios. Another important aspect to consider is that, although an empirical comparative study of all methods is not conducted, there are still a significant number of proposed methods whose experimental protocols include comparisons with very competitive resampling techniques (e.g., Radius-SMOTE, GSMOTE-NFM, RSMOTE, SMOTE-WENN, and the method proposed by Sasada et al. all exhibit superior average performance compared to SMOTE-IPF, the best-performing method among the 85 oversampling techniques tested in [68]), thus substantiating the drawn conclusions regarding the emergence of more adaptable methods and the potential of resampling recommendation systems. Concurrently, it should also be noted that, to be best of our knowledge, this is one of the only reviews addressing resampling approaches to handle class imbalance in such a holistic data-centric manner, providing a joint review of class imbalance from a theoretical standpoint, its quantification, a survey of resampling approaches and, lastly, an analysis of existing resampling recommendation systems. This not only illustrates the pertinence but also the innovative character of the conducted research.
Finally, after reviewing the literature concerning the different aspects related to resampling approaches to handle class imbalance, we present what we consider to be the principal open research problems that should be addressed in future research:
Development of a methodology capable of holistically assessing the complexity of the classification task in imbalanced regimes. This contribution would be fundamental and have applicability on multiple fronts: (i) it would allow for correlating the suitability of various resampling methods with the respective data characteristics, that is, the presence and severity of data difficulty factors; (ii) it would facilitate more appropriate benchmarking of resampling methods, making it easier to identify the types of data that most undermine their performance. Note that the analysis of resampling methods reported in the literature is predominantly conducted according to the Imbalance Ratio, which, as previously discussed, provides little information regarding the complexity of the classification task. (iii) A more comprehensive evaluation of the classification complexity could be instrumental in the development of more capable resampling methods and resampling recommendation systems.
Development of more adaptable resampling methods, capable of optimally addressing classification problems with varying levels of classification complexity. Two key factors could play a crucial role in achieving greater methodological adaptability. First, the integration of multiple complexity metrics in the oversampling process to assess the presence, interaction, and severity of different data difficulty factors could offer deeper insights into classification complexity, facilitating the development of resampling strategies tailored to each specific problem. Secondly, exploring the relationship between data difficulty factors and the most effective resampling techniques is crucial. For example, in the context of oversampling, it is important to identify the optimal method for determining the probability of each sample being oversampled, based on the degree of overlap in the data (evaluated using one or more complexity metrics). This relationship should also be investigated in all the other aspects of the oversampling process (those highlighted in Fig. 2), such as whether selecting an initial subset of samples enhances performance, the best methodology for such selection, whether dimensionality reduction is advantageous, etc. Furthermore, these studies should extend beyond overlap to consider, either individually or collectively, other data difficulty factors. A deeper understanding of these relationships is essential for developing more adaptable resampling techniques, a topic that remains largely unexplored in the literature.
The creation of a standardized benchmarking methodology to facilitate the direct comparison of proposed resampling methods. This adjustment to the current experimental protocols would be instrumental in ensuring that newly proposed methods are consistently compared with the most competitive state-of-the-art methods, rather than outdated methods like SMOTE. Additionally, a modular evaluation of the different phases of resampling methods would be beneficial, allowing the identification of the most relevant mechanisms and promoting the development of new approaches. Lastly, considering the relevance of these algorithms' application in real-world scenarios, it is crucial to consider evaluation metrics that concurrently consider runtime and performance.
As suggest in [1], a methodology remains to be defined that can adaptively perform resampling in accordance with user-defined goals, that is, methods capable of conducting resampling to an extent where precision is maximized, even if this incurs a high false positive rate, or alternatively, to strike a balance between the recognition capabilities for both majority and minority classes (aiming for a high F1 score). While there exist aggressive techniques for expanding the minority class (such as oversampling near the decision boundary) which result in enhanced precision or oversampling techniques targeting low instance hardness samples (which may provide better F1-score), they lack the ability to finely control the characteristics of the obtained classifiers, in contrast to cost-sensitive approaches. In methods that employ resampling by assigning a selection probability to each instance, this objective could potentially be achieved by adaptively adjusting the sample selection probability based on the proximity to the opposite class. For instance, by assigning a higher selection probability to safe samples, improved F1 scores could be realized, while favouring samples with high instance hardness may foster higher precision. Subtle modifications to the probability distribution within this range may enable finer regulation of classifier´s behaviour.
Acknowledgements
Not applicable.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.