research-article

Entropy based feature selection for text categorization

Authors:
Christine Largeron

Université de Lyon, Saint-Étienne, France

Université de Lyon, Saint-Étienne, France
View Profile

,
Christophe Moulin

Université de Lyon, Saint-Étienne, France

Université de Lyon, Saint-Étienne, France
View Profile

,
Mathias Géry

Université de Lyon, Saint-Étienne, France

Université de Lyon, Saint-Étienne, France
View Profile

SAC '11: Proceedings of the 2011 ACM Symposium on Applied ComputingMarch 2011Pages 924–928https://doi.org/10.1145/1982185.1982389

Published:21 March 2011Publication History

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

Pages 924–928

ABSTRACT

In text categorization, feature selection can be essential not only for reducing the index size but also for improving the performance of the classifier. In this article, we propose a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). On the one hand, this criterion is based on the distribution of the documents containing the term in the categories, but on the other hand, it takes into account its entropy. ECCD compares favorably with usual feature selection methods based on document frequency (DF), information gain (IG), mutual information (IM), χ², odd ratio and GSS on a large collection of XML documents from Wikipedia encyclopedia. Moreover, this comparative study confirms the effectiveness of selection feature techniques derived from the χ² statistics.

References

M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78--102. Idea Group Publishing, Hershey, US, 2001. Google ScholarDigital Library
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391--407, 1990.Google ScholarCross Ref
L. Denoyer and P. Gallinari. The Wikipedia XML corpus. SIGIR Forum, 40(1): 64--69, 2006. Google ScholarDigital Library
L. Denoyer and P. Gallinari. Overview of the INEX 2008 XML Mining Track. In Proceedings of the INEX Workshop INtitiative for Evaluation of XML Retrieval, pages 401--411, 2008.Google Scholar
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM'98: Proceedings of the 7th international conference on Information and knowledge management, pages 148--155, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9: 1871--1874, 2008. Google ScholarDigital Library
G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: 1289--1305, 2003. Google ScholarDigital Library
L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL '00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68. Springer-Verlag, 2000. Google ScholarDigital Library
J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd edition. Morgan Kaufman Publishers, 2006. Google ScholarDigital Library
B. C. How and W. T. Kiong. An examination of feature selection frameworks in text categorization. In AIRS'05: Proceedings of 2nd Asia information retrieval symposium, pages 558--564. Lecture notes in computer science, 2005. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, ECML'98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142. Springer-Verlag, Heidelberg, DE, 1998. Google ScholarDigital Library
D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language Workshop, pages 212--217. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992. Google ScholarDigital Library
D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In SDAIR'94: Proceedings of the Symposium on Document Analysis and Information Retrieval, pages 81--93, 1994.Google Scholar
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: 361--397, 2004. Google ScholarDigital Library
Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal, 41: 537--546, 1998.Google ScholarCross Ref
I. Moulinier and J.-G. Ganascia. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 343--354. Springer-Verlag, 1996. Google ScholarDigital Library
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 67--73, 1997. Google ScholarDigital Library
M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130--137, 1980.Google ScholarCross Ref
J. S. Ronen Feldman. The text mining handbook: Advanced approaches to analysing unstructured data. Cambridge University Press, Cambridge, 2007. Google ScholarDigital Library
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communations of the ACM, 18(11): 613--620, 1975. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002. Google ScholarDigital Library
C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27: 379--423 and 623--656, 1948.Google ScholarCross Ref
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. Google ScholarDigital Library
E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR'99: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, ICML'97: Proceedings of the 14th International Conference on Machine Learning, pages 412--420. Morgan Kaufmann Publishers, San Francisco, US, 1997. Google ScholarDigital Library

Recommendations

Maximum entropy modeling with feature selection for text categorization
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Maximum entropy provides a reasonable way of estimating probability distributions and has been widely used for a number of language processing tasks. In this paper, we explore the use of different feature selection methods for text categorization using ...
Read More
MMR-based feature selection for text categorization
HLT-NAACL-Short '04: Proceedings of HLT-NAACL 2004: Short Papers

We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. ...
Read More
A General Framework of Feature Selection for Text Categorization
MLDM '09: Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
March 2011
1868 pages
ISBN:9781450301138
DOI:10.1145/1982185
Conference Chairs:
William Chu
Tunghai University, TaiChung, Taiwan
,
W. Eric Wong
University of Texas at Dallas, Richardson, Texas
,
Program Chairs:
Mathew J. Palakal
Indiana University Purdue University, Indianapolis
,
Chih-Cheng Hung
Southern Polytechnic State University, Marietta
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 418
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Entropy based feature selection for text categorization

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Recommendations

Maximum entropy modeling with feature selection for text categorization

MMR-based feature selection for text categorization

A General Framework of Feature Selection for Text Categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Entropy based feature selection for text categorization

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Recommendations

Maximum entropy modeling with feature selection for text categorization

MMR-based feature selection for text categorization

A General Framework of Feature Selection for Text Categorization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media