1 Introduction
2 Learning from imbalanced data: preliminaries
2.1 Tackling imbalanced data
-
Data-level methods that modify the collection of examples to balance distributions and/or remove difficult samples.
-
Algorithm-level methods that directly modify existing learning algorithms to alleviate the bias towards majority objects and adapt them to mining data with skewed distributions.
-
Hybrid methods that combine the advantages of two previous groups.
2.2 Real-life imbalanced problems
Application area | Problem description |
---|---|
Activity recognition [19] | Detection of rare or less-frequent activities (multi-class problem) |
Behavior analysis [3] | Recognition of dangerous behavior (binary problem) |
Cancer malignancy grading [30] | Analyzing the cancer severity (binary and multi-class problem) |
Hyperspectral data analysis [50] | Classification of varying areas in multi-dimensional images (multi-class problem) |
Industrial systems monitoring [44] | Fault detection in industrial machinery (binary problem) |
Sentiment analysis [65] | Emotion and temper recognition in text (binary and multi-class problem) |
Software defect prediction [48] | Recognition of errors in code blocks (binary problem) |
Target detection [45] | Classification of specified targets appearing with varied frequency (multi-class problem) |
Text mining [39] | Detecting relations in literature (binary problem) |
Video mining [20] | Recognizing objects and actions in video sequences (binary and multi-class problem) |
3 Binary imbalanced classification
3.1 Analyzing the structure of classes
-
It is important to propose new classifiers that can either directly or indirectly incorporate this background knowledge about objects into their training procedure. Thus, when designing an efficient classifier one must not only alleviate the bias towards the majority class, but also pay attention to difficulties of individual minority examples. Preliminary works present in the literature show the high potential of this solution [4, 32].
-
The same idea can be carried over to data preprocessing approaches. Here established structure of the minority class would allow for selecting important or difficult samples to concentrate on. This way we may vary the level of oversampling according to example types (first idea can be tracked to ADASYN method that ignored safe examples [23]) or supervise the undersampling procedure in order not to discard the important minority representatives.
-
Separate studies should be done on the role of noisy / outlier samples in minority class. Recent work by Sáez et al. [47] suggests to remove such examples. However, when dealing with small sample sizes further reductions may be dangerous. Additionally, how one can be sure that given object is an actual noise or outlier and not an inappropriately sampled minority representative? Thus removing such example may lead to wrong classification of potential new objects to appear in its neighborhood.
-
The current labeling method for identifying types of minority objects relies on k-nearest neighbors (with k = 5) or kernel methods. This, however, strongly implies uniform distribution of data. It seems promising to propose adaptive methods that will adjust the size of analyzed neighborhood according to local densities or chunk sizes.
3.2 Extreme class imbalance
-
In cases with such a high imbalance the minority class is often poorly represented and lacks a clear structure. Therefore, straightforward application of preprocessing methods that rely on relations between minority objects (like SMOTE) can actually deteriorate the classification performance. Using randomized methods may also be inadvisable due to a high potential variance induced by the imbalance ratio. Methods that will be able to empower the minority class and predict or reconstruct a potential class structure seem to be a promising direction.
-
Another possible research track to be envisioned relies on decomposition of the original problem into a set of subproblems, each characterized by a reduced imbalance ratio. Then canonical methods could be used. This approach, however, requires a two-way development: (i) algorithms for meaningful problem division and (ii) algorithms for reconstruction of the original extremely imbalanced task.
-
Third challenge lies in efficient extraction of features for such problems. Internet transactions or protein data are characterized by potentially high-dimensional and sparse feature spaces. Furthermore one can predict emergence of such problems in social networks and computer vision. This calls for development of new approaches for representing such data that will allow at the same time for an efficient processing and boosting discrimination of the minority class.
3.3 Classifier’s output adjustment
-
Currently the output is being adjusted for each class separately, using the same value of compensation parameter for each classified object. However, from our previous discussion we may see that the minority class usually is not uniform and difficulty level may vary among objects within. Therefore, it is interesting to develop new methods that will be able to take into consideration the characteristics of classified example and adjust classifier’s output individually for each new object.
-
A drawback of methods based on output adjustment lies in possibility of overdriving the classifier towards the minority class, thus increasing the error on the majority one. As we may expect that the disproportion between classes will hold also for new objects to be classified, we may assume that output compensation will not always be required (as objects will predominantly originate from majority class distribution). Techniques that will select uncertain samples and adjust outputs only for such objects are of interest to this field. Additionally, dynamic classifier selection between canonical and adjusted classifiers seems as a potential useful framework.
-
Output adjustment is considered as an independent approach. Yet from a general point of view it may be fruitful to modify outputs even when data-level or algorithm-level solutions have been previously utilized. This way we may achieve class balancing on different levels, creating more refined classifiers. Analyzing the output compensation may also bring new insight into supervising undersampling or oversampling to find balanced performance on both classes.
3.4 Ensemble learning
-
There is a lack of good understanding of diversity in imbalanced learning. What actually contributes to this notion? Is diversity on majority class as important as on minority class? Undersampling-based ensembles usually maintain the minority class intact or introduce some small variations to it. Therefore, their diversity should not be very high which intuitively is a significant drawback. How then we can explain their excellent classification performance? A detailed study of this is required to better understand this phenomenon.
-
There are no clear indicators on how large ensembles should be constructed. Usually their size is selected arbitrarily, which may result in some similar classifiers being stored in the pool. It would be beneficial to analyze relations betweens characteristics of imbalanced dataset and the number of classifiers required to efficiently handle it, while maintaining their individual quality and mutual complementarity. Additionally, ensemble pruning techniques dedicated specifically to imbalanced problems should be developed [18].
-
Finally, most of imbalanced ensemble techniques use majority voting combination method. In standard scenarios this is a simple and effective solution. However, we may ask ourselves if this is really suitable for imbalanced learning, especially in case of randomized methods. It seems reasonable to assume that base classifiers trained using sampling methods will differ in their individual qualities due to being based on examples with varying difficulties. Therefore, this will propagate into their individual competencies that should be exploited during the combination phase.
4 Multi-class imbalanced classification
4.1 Data preprocessing
-
It is interesting to analyze the type of examples present in each class and their relations to other classes. Here it is not straightforward to measure the difficulty of each sample, as it may change with respect to different classes. For example, a given object may be of borderline type for some groups and at the same time a safe example when considering remaining classes. Therefore, a new and more flexible taxonomy should be presented. Our initial works on this topic indicate that analysis of example difficulty may significantly boost the multi-class performance [46].
-
New data cleaning methods must be developed to handle presence of overlapping and noisy samples that may additionally contribute to deteriorating classifier’s performance. One may think of projections to new spaces in which overlapping will be alleviated or simple removal of examples. However, measures to evaluate if a given overlapping example can be discarded without harming one of classes are needed. In case of label noise it is very interesting to analyze its influence on actual imbalance between classes. Wrongly labeled samples may increase the imbalance (when actually it’s ratio is lower) or mask actual disproportions (wrongly re-balancing the classes). Such scenarios require dedicated methods for detecting and filtering noise, as well as strategies for handling and relabeling such examples.
-
New sampling strategies are required for multi-class problems. Simple re-balancing towards the biggest or smallest class is not a proper approach [1, 15]. We need to develop dedicated methods that will adjust the sampling procedures to both individual properties of classes and to their mutual relations. Hybrid approaches, utilizing more than one method seem as an attractive solution.
4.2 Multi-class decomposition
-
Decomposition methods used so far [14] apply identical approach for each pair of classes. This seems as a slightly naive solution as the pairwise relationships may highly vary. Adjusting used solution to individual pairwise problems seems a highly more flexible solution. Calculating the cost penalties or oversampling ratios individually for each pairs is a good starting direction, but the true challenge lies in proposing a framework that will be able to select specific data or algorithm-level solution on the basis of subproblem characteristics.
-
So far only binary decomposition in form of one-vs-one and one-vs-all was considered. However, there is a number of different techniques that may achieve the same goal, while alleviating the drawbacks of binary solutions (like high number of base classifiers or introducing additional artificial imbalance). Hierarchical methods seem as a promising direction. Here we need solutions to aggregate classes according to their similarities or dissimilarities, preprocess them at each level and then use a sequential, step-wise approach to determine the final class. Alternatively, decomposition with one-class classifiers can be considered, as they are robust to class imbalance and can serve as an efficient tool for dealing with difficult multi-class datasets [33].
-
Finally, using decomposition methods require dedicated combination strategies that are able to reconstruct the original multi-class problem. As ones canonically used in this field were designed for roughly balanced scenarios, it seems worthwhile to design new fusion approaches suitable for cases with skewed distributions. This way it may be possible to compensate for the imbalance both on decomposed class level and on final output combination level.
4.3 Multi-class classifiers
-
A deeper insight is required into how multiple skewed distributions affect the forming of decision boundaries in classifiers. As Hellinger distance was proven to be useful in class imbalance cases, it should be incorporated to other distance-based classifiers. Other solutions with potential robustness to imbalance, like density-based methods, must be explored. When combined with approaches for reducing class overlapping and label noise (two phenomenons present in multi-class imbalanced data that may severely degrade the density estimation process) they should provide a powerful tool for classification.
-
Ensemble solutions should be investigated. They may offer ways to rebalance the distributions in varied ways. Exploring local competencies of classifiers and creating sectional decision areas may also alleviate significantly the difficulty of the problem. Here once again rises the issue of maintaining the diversity in ensemble systems and proper selection of useful base learners.
5 Multi-label and multi-instance imbalanced classification
-
In multi-label learning there is a need for skew-insensitive classifiers that do not require resampling strategies. It seems promising to use existing multi-label methods (such as hierarchical multi-label classification tree or classifier chains) and combine them with skew-insensitive solutions available in multi-class classification domain. An ideal goal would be development of such multi-label classifiers that display similar performance to canonical methods on balanced multi-label problems, while being at the same time robust to presence of imbalance.
-
Another interesting direction is to investigate the possibilities of using decomposition-based solutions. Binary relevance is a popular method, transforming a multi-label problem into a set of two-label subproblems. Hence, balancing label distributions seems more straightforward when applied to each subproblem individually. This problem could also be approached as an aggregation, to create balanced super-classes and then solve the problem in an iterative divide-and-conquer manner.
-
In multi-instance learning when resampling bags we must take into consideration that they may be characterized by a different level of uncertainty. Bags that have a higher probability of consisting purely of objects from the same class should be prioritized, as they carry a better representation of the target concept. This requires a development of new measures for assessing the quality of training bags and selecting the most useful ones.
-
Current sampling approaches for multi-instance learning work on the level of either bags or objects within bags. However, difficulties may arise from both of these situations occurring at the same time. Firstly, it is important to establish how each of those types of sample imbalance affects a classifier’s performance and is any of them more harmful to it. Secondly, global schemes for tackling these two types of imbalance at once must be proposed. They should identify, which type of imbalance is predominant and adapt their solution based on the characteristics of analyzed problem.
6 Regression in imbalanced scenarios
-
Development of cost-sensitive regression solutions that are able to adapt the penalty to the degree of importance assigned to rare observations. Not only existing methods should be modified to include object-related cost information, but also methodologies for assigning such costs must be developed. It seems interesting to investigate the possibility of adapting the cost not only to the minority group, but to each individual observation. This would allow for a more flexible prediction of rare events of differing importance.
-
Methods that will allow to distinguish between minority and noisy samples must be proposed. In case of small number of minority observations the potential presence of noisy ones may significantly affect both preprocessing and regression itself. Therefore, a deeper insight into what makes a certain observation a noisy outlier and what a valuable minority example is required.
-
As in classification problems, ensemble learning may offer significant improvement in both robustness to skewed distributions and in predictive power. This direction is especially interesting, as the notion of ensemble diversity is better explored from theoretical point of view in regression than in classification [6]. This should allow to developing efficient ensemble regression systems that will have directly controlled diversity level on minority observations.
7 Semi-supervised and unsupervised learning from imbalanced data
-
There exist plethora of cluster validity indexes designed for evaluating and selecting proper clustering models. However, none of them takes into account the fact that actual clusters may be of highly varying sizes. Therefore, existing methods should be modified and new indexes should be proposed that measure how well discovered groups reflect the actual skewed distributions in analyzed dataset.
-
Popular centroid-based clustering approaches should be adjusted to imbalanced scenarios, allowing more flexible cluster adjustment. A high potential lies in hybridization with density-based methods for analyzing the local neighborhood of centroid, or in local differing of similarity measures that are used to assign given object to certain cluster.
-
It is very interesting to apply clustering only on the minority class. This way we would be able to discover substructures within that can be utilized in further learning steps. Extracting additional knowledge about the examples within the minority class would also allow for a more detailed insight into difficulties embedded in the nature of data. However, for this we need refined clustering solutions that would be able to analyze the structure of this class, while paying attention to different possible types of examples that may appear. Here an algorithm cannot be to generative (not to lose small groups of objects) or to prone to overfitting (which can be an issue in case of small-sample sized minority classes).
-
Clustering can also be used as a method for fitting more specialized classifiers. Hence, we would be interested in methods that are able to divide the original problem into a set of atomic subproblems and identify difficult areas in the decision space. Such clusters would allow to decompose the classification problem and analyze it independently for each part of the decision space, thus being able to locally adapt the solution to the level and type of difficulty present.
-
Important issue is how to detect the underlying class imbalance in semi-supervised and active learning. Here the initial batch of examples may be fairly balanced, but original distributions can be significantly skewed. On the other hand, the starting set of objects may display imbalance that do not reflect the original one (majority class may be in fact the minority one or there may be in fact no imbalance at all). To answer such questions novel unsupervised methods for assessing the distributions and potential difficulty of unlabeled objects must be introduced. Additionally, novel active learning strategies that can point out to the most difficult objects that will have highest effect on learned decision boundaries are of interest to machine learning community.
8 Learning from imbalanced data streams
-
Most of existing works assume that we deal with a binary data stream for which the relationships between classes may change over time: one of the classes may obtain increased number of samples in given time window, other class may have reduced number of new instances, or classes may reach equilibrium. However, the imbalanced problem may occur in data streams from various reasons. A very important aspect that requires a proper attention is connected with the phenomenon of new class emergence and/or disappearance of the old ones. When a new class appears it will be naturally underrepresented with respect to the ones present so far. Even with increased number of examples from it the imbalance still will be present, as classifiers have processed a significant number of objects before this new class emerged. It is interesting how to handle this bias originating from classifier’s history. Another issue is how to handle fading classes. When objects from given distribution will become less and less frequent should we increased their importance along with increasing imbalance ratio? Or on the contrary, we should reduce it as this class becomes less representative to the current state of the stream?
-
Another vital challenge is connected with the availability of class labels. Many contemporary works assume that class label is available immediately after the new sample is being classified. This is far from real-life scenarios and would impose tremendous labeling costs on the system. Therefore, method for streaming data that reduce the cost of supervision (e.g., by active learning) is currently of crucial importance in this field [69]. This raises an open issue on how to sample imbalanced data streams? Can the active learning approach be beneficial to reducing the bias towards the majority class by intelligent selection of samples? On the other hand, one must develop labeling strategies that will not overlook minority representatives and adjust their labeling ratio to how well the current state of minority class is captured.
-
In many real-life streaming applications (like computer vision or social networks) the imbalance may be caused by reappearing source. This leads to recurring drift that affects proportions between class distributions. Thus, it seems worthwhile to develop methods dedicated to storing general solutions to such scenarios instead of reacting to each reappearance of class imbalance anew. This requires development of algorithms to extract drift’s templates and use them to train specific classifiers that will be stored in repository. When a similar change in class distributions will appear we may use available classifier instead of training a new one from the scratch. To obtain best flexibility and account for the fact that even in recurring drifts the properties of individual imbalanced cases may differ, one needs fast adaptation methods that will use the stored classifiers as a starting point.
-
Using characteristics and structure of minority class is a promising direction for static imbalanced learning. But is it possible to adapt this approach to streaming data? The analysis of minority example types poses a high computational cost, but one may use hardware speed-up to alleviate this problem. Additionally, even if samples arrive online their influence on minority class is local which may be an useful hint for designing such systems. Algorithms dedicated to tracking and analyzing the history of changes in minority class structure may also provide us with valuable insight into the evolution of imbalance problem in given stream over time.
9 Imbalanced big data
-
SMOTE-based oversampling methods applied in distributed environments such as MapReduce tend to fail [13]. This can be caused by a random partitioning of data for each mapper and thus introducing artificial samples on the basis of real objects that have no spatial relationships. Therefore, to apply SMOTE-based techniques for massive datasets one either require new global-scale and efficient implementations, data partitioning methods that preserve relations between examples, or some global arbitration unit that will supervise the oversampling process.
-
When mining big data we are interested in value we can extract from this process. A promising direction lies in developing methods that by analyzing the nature of imbalance will allow us to gain a deeper understanding of given problem. What is the source of imbalance, which types of objects are the most difficult ones, where overlapping and noise occurs and how does this translate to its business potential? Additionally, interpretable classifiers that can handle massive and skewed data are of interest.
-
We must develop methods for processing and classifying big data in form of graphs, xml structures, video sequences, hyperspectral images, associations, tensors etc [7, 37]. Such data types are becoming more and more frequent in both imbalanced and big data analytics and impose certain restrictions on machine learning systems. Instead of trying to convert them to numerical values it seems valuable to design both preprocessing and learning algorithms that will allow a direct handling of massive and skewed data represented as such complex structures.
-
When dealing with imbalanced big data we face one of two possible scenarios: when majority class is massive and minority class is of a small sample size and when imbalance is present but representatives from both classes are abundant. First issue is related directly to the problem of extreme imbalance discussed in Sect. 3.2 and solutions proposed there should be adjusted to large-scale analytics. Second issue is related to the observations that imbalance ratio may not be the main source of learning difficulties. This requires an in-depth analysis of the structure of minority class and examples present there. But it also raises a question: is the taxonomy discussed in Sect. 3.1 still valid when facing massive datasets? Big data imbalance may cause the appearance of new types of examples or changes in properties of already described types. Additionally, we deal with a much more complex scenarios that would require local analysis of each difficult region and fitting solutions individually to each of them.
10 Conclusions
-
Focus on the structure and nature of examples in minority classes in order to gain a better insight into the source of learning difficulties.
-
Develop methods for multi-class imbalanced learning that will take into account varying relationships between classes.
-
Propose new solutions for multi-instance and multi-label learning that are based on specific structured nature of these problems.
-
Introduce efficient clustering methods for unevenly distributed object groups and measures to properly evaluate and select partitioning models in such scenarios.
-
Consider imbalanced regression problems and develop methods for deeper analysis of individual properties of rare examples.
-
Analyze the nature of class imbalance in data streams beyond simple notion of two classes with shifting distributions.
-
Gain a deeper insight into the potential value that can be extracted from interpretable analysis of imbalanced big data.