skip to main content
10.1145/2911451.2914722acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Distributional Random Oversampling for Imbalanced Text Classification

Published:07 July 2016Publication History

ABSTRACT

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

References

  1. Gustavo E. Batista, Ronaldo C. Prati, and Maria C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kolcz. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1):1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrea Esuli and Fabrizio Sebastiani. Training data cleaning for text classification. ACM Transactions on Information Systems, 31(4), 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing, pages 878--887. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zellig S. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarGoogle ScholarCross RefCross Ref
  8. Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263--1284, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  10. Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf, Christopher J. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Ross Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning, 6(1):93--98, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297--336, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yanmin Sun, Andrew K. Wong, and Mohamed S. Kamel. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4):687--719, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(1):281--288, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 435--442, Melbourne, US, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributional Random Oversampling for Imbalanced Text Classification

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in
                    • Published in

                      cover image ACM Conferences
                      SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
                      July 2016
                      1296 pages
                      ISBN:9781450340694
                      DOI:10.1145/2911451

                      Copyright © 2016 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 7 July 2016

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • short-paper

                      Acceptance Rates

                      SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader