ABSTRACT
In text categorization, feature selection can be essential not only for reducing the index size but also for improving the performance of the classifier. In this article, we propose a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). On the one hand, this criterion is based on the distribution of the documents containing the term in the categories, but on the other hand, it takes into account its entropy. ECCD compares favorably with usual feature selection methods based on document frequency (DF), information gain (IG), mutual information (IM), χ2, odd ratio and GSS on a large collection of XML documents from Wikipedia encyclopedia. Moreover, this comparative study confirms the effectiveness of selection feature techniques derived from the χ2 statistics.
- M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78--102. Idea Group Publishing, Hershey, US, 2001. Google ScholarDigital Library
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391--407, 1990.Google ScholarCross Ref
- L. Denoyer and P. Gallinari. The Wikipedia XML corpus. SIGIR Forum, 40(1): 64--69, 2006. Google ScholarDigital Library
- L. Denoyer and P. Gallinari. Overview of the INEX 2008 XML Mining Track. In Proceedings of the INEX Workshop INtitiative for Evaluation of XML Retrieval, pages 401--411, 2008.Google Scholar
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM'98: Proceedings of the 7th international conference on Information and knowledge management, pages 148--155, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9: 1871--1874, 2008. Google ScholarDigital Library
- G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: 1289--1305, 2003. Google ScholarDigital Library
- L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL '00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68. Springer-Verlag, 2000. Google ScholarDigital Library
- J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd edition. Morgan Kaufman Publishers, 2006. Google ScholarDigital Library
- B. C. How and W. T. Kiong. An examination of feature selection frameworks in text categorization. In AIRS'05: Proceedings of 2nd Asia information retrieval symposium, pages 558--564. Lecture notes in computer science, 2005. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, ECML'98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142. Springer-Verlag, Heidelberg, DE, 1998. Google ScholarDigital Library
- D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language Workshop, pages 212--217. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992. Google ScholarDigital Library
- D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In SDAIR'94: Proceedings of the Symposium on Document Analysis and Information Retrieval, pages 81--93, 1994.Google Scholar
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: 361--397, 2004. Google ScholarDigital Library
- Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal, 41: 537--546, 1998.Google ScholarCross Ref
- I. Moulinier and J.-G. Ganascia. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 343--354. Springer-Verlag, 1996. Google ScholarDigital Library
- H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 67--73, 1997. Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130--137, 1980.Google ScholarCross Ref
- J. S. Ronen Feldman. The text mining handbook: Advanced approaches to analysing unstructured data. Cambridge University Press, Cambridge, 2007. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communations of the ACM, 18(11): 613--620, 1975. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002. Google ScholarDigital Library
- C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27: 379--423 and 623--656, 1948.Google ScholarCross Ref
- V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. Google ScholarDigital Library
- E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.Google Scholar
- Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR'99: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, ICML'97: Proceedings of the 14th International Conference on Machine Learning, pages 412--420. Morgan Kaufmann Publishers, San Francisco, US, 1997. Google ScholarDigital Library
Recommendations
Maximum entropy modeling with feature selection for text categorization
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technologyMaximum entropy provides a reasonable way of estimating probability distributions and has been widely used for a number of language processing tasks. In this paper, we explore the use of different feature selection methods for text categorization using ...
MMR-based feature selection for text categorization
HLT-NAACL-Short '04: Proceedings of HLT-NAACL 2004: Short PapersWe introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. ...
A General Framework of Feature Selection for Text Categorization
MLDM '09: Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern RecognitionMany feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection ...
Comments