ABSTRACT
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level.
We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures.
- 1.Apte, C., Damerau, F. and Weiss, S. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233-251, 1994.]] Google ScholarDigital Library
- 2.Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 163-178, 1998.]] Google ScholarDigital Library
- 3.Chen, H. and Dumais, S. Bringing order to the web: Automatically categorizing search results. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems ( CHI'O0), 145-152, 2000.]] Google ScholarDigital Library
- 4.Cohen, W.W. and Singer, Y. Context-sensitive learning methods for text categorization Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 307-315, 1996.]] Google ScholarDigital Library
- 5.Cover, T. and Thomas, J. Elements of Information Theory. Wiley, 1991.]] Google ScholarDigital Library
- 6.D'Alessio, S., Murray, M., Schiaffino, R. and Kershenbaum, A. Category levels in hierarchical text categorization. Proceedings of EMNLP-3, 3rd Conference on Empirical Methods in Natural Language Processing, 1998.]]Google Scholar
- 7.Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management ( CIKM'98 ), 148-155, 1998.]] Google ScholarDigital Library
- 8.Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. Air/X - A rule-based multi-stage indexing system for lage subject fields. Proceedings of RIAO'91,606-623, 1991.]]Google Scholar
- 9.Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conference on Innovative Applications of Artificial Intelligence, 1990.]] Google ScholarDigital Library
- 10.Hearst, M., and Karadi, C. Searching and browsing text collections with large category hierarchies. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI'97), Conference Companion, 1997.]] Google ScholarDigital Library
- 11.Hearst, M. and Pedersen, J. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. Proceedings of 19 th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 1996.]] Google ScholarDigital Library
- 12.Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on Machine Learning (ECML '98), 1998]] Google ScholarDigital Library
- 13.Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 170-178, 1997.]] Google ScholarDigital Library
- 14.Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D. Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. Hypertext -A Psychological Perspective. Ellis Horwood, 1993.]]Google Scholar
- 15.Larkey, L. Some issues in the automatic classification of U.S. patents. In Working Notes for the AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google Scholar
- 16.Lewis, D.D. and Ringuette, M.. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval ( SDAIR'94 ), 81-93, 1994.]]Google Scholar
- 17.McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. Proceedings of the Fifteenth International Conference on Machine Learning, (ICML-98), 359-367, 1998.]] Google ScholarDigital Library
- 18.Mladenic, D. and Grobelnik, M. Feature selection for classification based on text hierarchy. Proceedings of the Workshop on Learning from Text and the Web, 1998.]]Google Scholar
- 19.Ng, H.T., Goh, W.B. and Low, K.L, Proceedings of 20 th Annual International ACM SIG1R Conference on Research and Development in Information Retrieval (SIGIR'97), 67-73, 1997.]] Google ScholarDigital Library
- 20.Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods -Support Vector Learning. B. Schtilkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.]] Google ScholarDigital Library
- 21.Ruiz, M.E. and Srinivasan, P. Hierarchical neural networks for text categorization. Proceedings of the 22nd International A CM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 281-282, 1999.]] Google ScholarDigital Library
- 22.Schiitze, H., Hull, D. and Pedersen, J.O. A comparison of classifiers and document representations for the routing problem. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 229- 237, 1995.]] Google ScholarDigital Library
- 23.Vapnik, V., Estimation of Dependencies Based on Data {in Russian}, Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)]] Google ScholarDigital Library
- 24.Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.]] Google ScholarDigital Library
- 25.Weigend, A.S., Wiener, E.D. and Pedersen, J.O. Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193-216, 1999.]] Google ScholarDigital Library
- 26.Yang, Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), 13- 22, 1994.]] Google ScholarDigital Library
- 27.Yang, Y. and Lui, Y. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 42-49, 1999.]] Google ScholarDigital Library
- 28.Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412-420, 1997.]] Google ScholarDigital Library
- 29.Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), 46-54, 1998.]] Google ScholarDigital Library
- 30.http://www.looksmart.com]]Google Scholar
Index Terms
- Hierarchical classification of Web content
Recommendations
Simple and accurate feature selection for hierarchical categorisation
DocEng '02: Proceedings of the 2002 ACM symposium on Document engineeringCategorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical ...
Classification using Hierarchical Naïve Bayes models
Classification problems have a long history in the machine learning literature. One of the simplest, and yet most consistently well-performing set of classifiers is the Naïve Bayes models. However, an inherent problem with these classifiers is the ...
Acclimatizing Taxonomic Semantics for Hierarchical Content Classification
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningHierarchical models have been shown to be effective in content classification. However, we observe through empirical study that the performance of a hierarchical model varies with given taxonomies; even a semantically sound taxonomy has potential to ...
Comments