skip to main content
10.1145/1982185.1982389acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Entropy based feature selection for text categorization

Published:21 March 2011Publication History

ABSTRACT

In text categorization, feature selection can be essential not only for reducing the index size but also for improving the performance of the classifier. In this article, we propose a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). On the one hand, this criterion is based on the distribution of the documents containing the term in the categories, but on the other hand, it takes into account its entropy. ECCD compares favorably with usual feature selection methods based on document frequency (DF), information gain (IG), mutual information (IM), χ2, odd ratio and GSS on a large collection of XML documents from Wikipedia encyclopedia. Moreover, this comparative study confirms the effectiveness of selection feature techniques derived from the χ2 statistics.

References

  1. M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78--102. Idea Group Publishing, Hershey, US, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  3. L. Denoyer and P. Gallinari. The Wikipedia XML corpus. SIGIR Forum, 40(1): 64--69, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Denoyer and P. Gallinari. Overview of the INEX 2008 XML Mining Track. In Proceedings of the INEX Workshop INtitiative for Evaluation of XML Retrieval, pages 401--411, 2008.Google ScholarGoogle Scholar
  5. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM'98: Proceedings of the 7th international conference on Information and knowledge management, pages 148--155, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9: 1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: 1289--1305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL '00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68. Springer-Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd edition. Morgan Kaufman Publishers, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. C. How and W. T. Kiong. An examination of feature selection frameworks in text categorization. In AIRS'05: Proceedings of 2nd Asia information retrieval symposium, pages 558--564. Lecture notes in computer science, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, ECML'98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142. Springer-Verlag, Heidelberg, DE, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language Workshop, pages 212--217. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In SDAIR'94: Proceedings of the Symposium on Document Analysis and Information Retrieval, pages 81--93, 1994.Google ScholarGoogle Scholar
  14. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: 361--397, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal, 41: 537--546, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  16. I. Moulinier and J.-G. Ganascia. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 343--354. Springer-Verlag, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 67--73, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130--137, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. S. Ronen Feldman. The text mining handbook: Advanced approaches to analysing unstructured data. Cambridge University Press, Cambridge, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communations of the ACM, 18(11): 613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27: 379--423 and 623--656, 1948.Google ScholarGoogle ScholarCross RefCross Ref
  23. V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.Google ScholarGoogle Scholar
  25. Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR'99: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, ICML'97: Proceedings of the 14th International Conference on Machine Learning, pages 412--420. Morgan Kaufmann Publishers, San Francisco, US, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
    March 2011
    1868 pages
    ISBN:9781450301138
    DOI:10.1145/1982185

    Copyright © 2011 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 21 March 2011

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,650of6,669submissions,25%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader