ABSTRACT
The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.
- Gustavo E. Batista, Ronaldo C. Prati, and Maria C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321--357, 2002. Google ScholarDigital Library
- Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kolcz. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1):1--6, 2004. Google ScholarDigital Library
- Andrea Esuli and Fabrizio Sebastiani. Training data cleaning for text classification. ACM Transactions on Information Systems, 31(4), 2013.Google ScholarDigital Library
- Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing, pages 878--887. Springer, 2005. Google ScholarDigital Library
- Zellig S. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarCross Ref
- Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263--1284, 2009. Google ScholarDigital Library
- Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, 2002. Google ScholarCross Ref
- Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf, Christopher J. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarDigital Library
- J. Ross Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning, 6(1):93--98, 1991. Google ScholarDigital Library
- Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297--336, 1999. Google ScholarDigital Library
- Yanmin Sun, Andrew K. Wong, and Mohamed S. Kamel. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4):687--719, 2009.Google ScholarCross Ref
- Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(1):281--288, 2009. Google ScholarDigital Library
- Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 435--442, Melbourne, US, 2003. Google ScholarDigital Library
Index Terms
- Distributional Random Oversampling for Imbalanced Text Classification
Recommendations
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets
A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
A novel oversampling and feature selection hybrid algorithm for imbalanced data classification
AbstractTraditional approaches tend to cause classier bias in the imbalanced data set, resulting in poor classification performance for minority classes. In particular, there are many imbalanced data in financial fraud, network intrusion, and fault ...
An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data
Highlights- A method to determine the oversampling size based on classification complexity.
AbstractResampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning ...
Comments