skip to main content
10.1145/345508.345593acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free Access

Hierarchical classification of Web content

Authors Info & Claims
Published:01 July 2000Publication History

ABSTRACT

This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level.

We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures.

References

  1. 1.Apte, C., Damerau, F. and Weiss, S. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233-251, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 163-178, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Chen, H. and Dumais, S. Bringing order to the web: Automatically categorizing search results. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems ( CHI'O0), 145-152, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.Cohen, W.W. and Singer, Y. Context-sensitive learning methods for text categorization Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 307-315, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.Cover, T. and Thomas, J. Elements of Information Theory. Wiley, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.D'Alessio, S., Murray, M., Schiaffino, R. and Kershenbaum, A. Category levels in hierarchical text categorization. Proceedings of EMNLP-3, 3rd Conference on Empirical Methods in Natural Language Processing, 1998.]]Google ScholarGoogle Scholar
  7. 7.Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management ( CIKM'98 ), 148-155, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. Air/X - A rule-based multi-stage indexing system for lage subject fields. Proceedings of RIAO'91,606-623, 1991.]]Google ScholarGoogle Scholar
  9. 9.Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conference on Innovative Applications of Artificial Intelligence, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.Hearst, M., and Karadi, C. Searching and browsing text collections with large category hierarchies. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI'97), Conference Companion, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.Hearst, M. and Pedersen, J. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. Proceedings of 19 th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on Machine Learning (ECML '98), 1998]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 170-178, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D. Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. Hypertext -A Psychological Perspective. Ellis Horwood, 1993.]]Google ScholarGoogle Scholar
  15. 15.Larkey, L. Some issues in the automatic classification of U.S. patents. In Working Notes for the AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google ScholarGoogle Scholar
  16. 16.Lewis, D.D. and Ringuette, M.. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval ( SDAIR'94 ), 81-93, 1994.]]Google ScholarGoogle Scholar
  17. 17.McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. Proceedings of the Fifteenth International Conference on Machine Learning, (ICML-98), 359-367, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.Mladenic, D. and Grobelnik, M. Feature selection for classification based on text hierarchy. Proceedings of the Workshop on Learning from Text and the Web, 1998.]]Google ScholarGoogle Scholar
  19. 19.Ng, H.T., Goh, W.B. and Low, K.L, Proceedings of 20 th Annual International ACM SIG1R Conference on Research and Development in Information Retrieval (SIGIR'97), 67-73, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods -Support Vector Learning. B. Schtilkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.Ruiz, M.E. and Srinivasan, P. Hierarchical neural networks for text categorization. Proceedings of the 22nd International A CM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 281-282, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.Schiitze, H., Hull, D. and Pedersen, J.O. A comparison of classifiers and document representations for the routing problem. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 229- 237, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.Vapnik, V., Estimation of Dependencies Based on Data {in Russian}, Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.Weigend, A.S., Wiener, E.D. and Pedersen, J.O. Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193-216, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.Yang, Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), 13- 22, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.Yang, Y. and Lui, Y. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 42-49, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412-420, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29.Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), 46-54, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.http://www.looksmart.com]]Google ScholarGoogle Scholar

Index Terms

  1. Hierarchical classification of Web content

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
              July 2000
              396 pages
              ISBN:1581132263
              DOI:10.1145/345508

              Copyright © 2000 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 July 2000

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate792of3,983submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader