short-paper

Distributional Random Oversampling for Imbalanced Text Classification

Authors:
Alejandro Moreo

Consiglio Nazionale delle Ricerche, Pisa, Italy

Consiglio Nazionale delle Ricerche, Pisa, Italy
View Profile

,
Andrea Esuli

Consiglio Nazionale delle Ricerche, Pisa, Italy

Consiglio Nazionale delle Ricerche, Pisa, Italy
View Profile

,
Fabrizio Sebastiani

Hamad bin Khalifa University, Doha, Qatar

Hamad bin Khalifa University, Doha, Qatar
View Profile

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalJuly 2016Pages 805–808https://doi.org/10.1145/2911451.2914722

Published:07 July 2016Publication History

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 805–808

ABSTRACT

The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

References

Gustavo E. Batista, Ronaldo C. Prati, and Maria C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20--29, 2004. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321--357, 2002. Google ScholarDigital Library
Nitesh V. Chawla, Nathalie Japkowicz, and Aleksander Kolcz. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1):1--6, 2004. Google ScholarDigital Library
Andrea Esuli and Fabrizio Sebastiani. Training data cleaning for text classification. ACM Transactions on Information Systems, 31(4), 2013.Google ScholarDigital Library
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing, pages 878--887. Springer, 2005. Google ScholarDigital Library
Zellig S. Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarCross Ref
Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263--1284, 2009. Google ScholarDigital Library
Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, 2002. Google ScholarCross Ref
Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf, Christopher J. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, chapter 11, pages 169--184. The MIT Press, Cambridge, US, 1999. Google ScholarDigital Library
J. Ross Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning, 6(1):93--98, 1991. Google ScholarDigital Library
Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297--336, 1999. Google ScholarDigital Library
Yanmin Sun, Andrew K. Wong, and Mohamed S. Kamel. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4):687--719, 2009.Google ScholarCross Ref
Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(1):281--288, 2009. Google ScholarDigital Library
Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 435--442, Melbourne, US, 2003. Google ScholarDigital Library

Index Terms

Recommendations

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
Read More
A novel oversampling and feature selection hybrid algorithm for imbalanced data classification
Abstract
Traditional approaches tend to cause classier bias in the imbalanced data set, resulting in poor classification performance for minority classes. In particular, there are many imbalanced data in financial fraud, network intrusion, and fault ...
Read More
An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data
Highlights
- A method to determine the oversampling size based on classification complexity.
Abstract
Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributional hypothesis
imbalanced text classification
oversampling
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 811
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributional Random Oversampling for Imbalanced Text Classification

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data