skip to main content
10.1145/585058.585079acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Simple and accurate feature selection for hierarchical categorisation

Published:08 November 2002Publication History

ABSTRACT

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6%, while top-down hierarchical categorisation accuracy can be improved by up to 12%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.

References

  1. C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L.D. Baker and A.K. McCallum. Distributional clustering of words for text classification. In R. Wilkinson, B. Croft, K. van Rijsbergen, A. Moffat, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 81--89, Melbourne, Australia, July 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. D'Alessio, K. Murray, R.Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302--313, Paris, 2000.Google ScholarGoogle Scholar
  4. S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown.Google ScholarGoogle Scholar
  5. S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N.J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Athens, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning, pages 143--151, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), volume 1398, pages 137--142, Berlin, 1998. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter~11, pages 169--184. The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 298--306, Zurich, Switzerland, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, Pittsburg, USA, 1998.Google ScholarGoogle Scholar
  13. S.E. Robertson and K. Sparck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129--146, May 1976.Google ScholarGoogle ScholarCross RefCross Ref
  14. J.J. Rocchio. Relevance feedback in information retrieval. In The Smart Retrieval System --- Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.Google ScholarGoogle Scholar
  15. M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In M.A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 281--282, Berkeley, CA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Salton, editor. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice-Hall, New Jersey, 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Schütze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 229--237, Seattle, WA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Sebastiani. Machine learning in automated text categorization. Computing Surveys, 34(1):1--47, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Shanks and H.E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pages 194--204, San Rafael, Chile, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  20. C.J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A.S. Weigend, E.D. Wiener, and J.O. Pedersen. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193--216, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Wibowo and H.E. Williams. On using hierarchies for document classification. In Proc. Australian Document Computing Conference, pages 31--37, Coffs Harbour, Australia, 1999.Google ScholarGoogle Scholar
  23. H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries. To appear.Google ScholarGoogle Scholar
  24. I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Yang. Noise reduction in a statistical approach to text categorization. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Seattle, Washington, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In D.H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann Publishers, San Francisco. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Simple and accurate feature selection for hierarchical categorisation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering
          November 2002
          168 pages
          ISBN:1581135947
          DOI:10.1145/585058

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 8 November 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          DocEng '02 Paper Acceptance Rate21of46submissions,46%Overall Acceptance Rate178of537submissions,33%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader