Hierarchical classification of Web content

Authors:
Susan Dumais

Microsoft Research, One Microsoft Way, Redmond, WA

Microsoft Research, One Microsoft Way, Redmond, WA
View Profile

,
Hao Chen

Computer Science Division, University of California at Berkeley, Berkeley, CA

Computer Science Division, University of California at Berkeley, Berkeley, CA
View Profile

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrievalJuly 2000Pages 256–263https://doi.org/10.1145/345508.345593

Published:01 July 2000Publication History

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 256–263

ABSTRACT

This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level.

We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures.

References

1.Apte, C., Damerau, F. and Weiss, S. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233-251, 1994.]] Google ScholarDigital Library
2.Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 163-178, 1998.]] Google ScholarDigital Library
3.Chen, H. and Dumais, S. Bringing order to the web: Automatically categorizing search results. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems ( CHI'O0), 145-152, 2000.]] Google ScholarDigital Library
4.Cohen, W.W. and Singer, Y. Context-sensitive learning methods for text categorization Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 307-315, 1996.]] Google ScholarDigital Library
5.Cover, T. and Thomas, J. Elements of Information Theory. Wiley, 1991.]] Google ScholarDigital Library
6.D'Alessio, S., Murray, M., Schiaffino, R. and Kershenbaum, A. Category levels in hierarchical text categorization. Proceedings of EMNLP-3, 3rd Conference on Empirical Methods in Natural Language Processing, 1998.]]Google Scholar
7.Dumais, S. T., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management ( CIKM'98 ), 148-155, 1998.]] Google ScholarDigital Library
8.Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. Air/X - A rule-based multi-stage indexing system for lage subject fields. Proceedings of RIAO'91,606-623, 1991.]]Google Scholar
9.Hayes, P.J. and Weinstein, S.P. CONSTRUE: A System for Content-Based Indexing of a Database of News Stories. Second Annual Conference on Innovative Applications of Artificial Intelligence, 1990.]] Google ScholarDigital Library
10.Hearst, M., and Karadi, C. Searching and browsing text collections with large category hierarchies. Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI'97), Conference Companion, 1997.]] Google ScholarDigital Library
11.Hearst, M. and Pedersen, J. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. Proceedings of 19 th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 1996.]] Google ScholarDigital Library
12.Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Proceedings of European Conference on Machine Learning (ECML '98), 1998]] Google ScholarDigital Library
13.Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 170-178, 1997.]] Google ScholarDigital Library
14.Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D. Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. Hypertext -A Psychological Perspective. Ellis Horwood, 1993.]]Google Scholar
15.Larkey, L. Some issues in the automatic classification of U.S. patents. In Working Notes for the AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google Scholar
16.Lewis, D.D. and Ringuette, M.. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval ( SDAIR'94 ), 81-93, 1994.]]Google Scholar
17.McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. Proceedings of the Fifteenth International Conference on Machine Learning, (ICML-98), 359-367, 1998.]] Google ScholarDigital Library
18.Mladenic, D. and Grobelnik, M. Feature selection for classification based on text hierarchy. Proceedings of the Workshop on Learning from Text and the Web, 1998.]]Google Scholar
19.Ng, H.T., Goh, W.B. and Low, K.L, Proceedings of 20 th Annual International ACM SIG1R Conference on Research and Development in Information Retrieval (SIGIR'97), 67-73, 1997.]] Google ScholarDigital Library
20.Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods -Support Vector Learning. B. Schtilkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.]] Google ScholarDigital Library
21.Ruiz, M.E. and Srinivasan, P. Hierarchical neural networks for text categorization. Proceedings of the 22nd International A CM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 281-282, 1999.]] Google ScholarDigital Library
22.Schiitze, H., Hull, D. and Pedersen, J.O. A comparison of classifiers and document representations for the routing problem. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 229- 237, 1995.]] Google ScholarDigital Library
23.Vapnik, V., Estimation of Dependencies Based on Data {in Russian}, Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)]] Google ScholarDigital Library
24.Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, 1995.]] Google ScholarDigital Library
25.Weigend, A.S., Wiener, E.D. and Pedersen, J.O. Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193-216, 1999.]] Google ScholarDigital Library
26.Yang, Y. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), 13- 22, 1994.]] Google ScholarDigital Library
27.Yang, Y. and Lui, Y. A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 42-49, 1999.]] Google ScholarDigital Library
28.Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 412-420, 1997.]] Google ScholarDigital Library
29.Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), 46-54, 1998.]] Google ScholarDigital Library
30.http://www.looksmart.com]]Google Scholar

Index Terms

Hierarchical classification of Web content
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Simple and accurate feature selection for hierarchical categorisation
DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical ...
Read More
Classification using Hierarchical Naïve Bayes models

Classification problems have a long history in the machine learning literature. One of the simplest, and yet most consistently well-performing set of classifiers is the Naïve Bayes models. However, an inherent problem with these classifiers is the ...
Read More
Acclimatizing Taxonomic Semantics for Hierarchical Content Classification
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Hierarchical models have been shown to be effective in content classification. However, we observe through empirical study that the performance of a hierarchical model varies with given taxonomies; even a semantically sound taxonomy has potential to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web hierarchies
classification
hierarchical models
machine learning
support vector machines
text catergorization
text classification
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 500
  Total Citations
  View Citations
- 1,288
  Total Downloads
- Downloads (Last 12 months)400
- Downloads (Last 6 weeks)58
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchical classification of Web content

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Simple and accurate feature selection for hierarchical categorisation

Classification using Hierarchical Naïve Bayes models

Acclimatizing Taxonomic Semantics for Hierarchical Content Classification