Top

Discover Computing

Published in:

01-08-2008

Boosting multi-label hierarchical text categorization

Authors: Andrea Esuli, Tiziano Fagni, Fabrizio Sebastiani

Published in: Discover Computing | Issue 4/2008

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for “flat” classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of “boosting” learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.

previous article Output-sensitive autocompletion search

next article An outranking approach for information retrieval

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://info.acm.org/class/1998/ccs98.html

http://www.nlm.nih.gov/mesh/meshhome.html

http://www.sti.nasa.gov/nasa-thesaurus.html

This is the result of the fact that the very first publicly available TC benchmarks (e.g. the Reuters-21578 benchmark—see Sect. 5.1) had no hierarchical structure; the TC literature ended up in overfitting, at least to some degree, the available benchmarks.

Consistently with most mathematical literature we use the caret symbol (\(\hat{\,}\)) to indicate estimation. In fact, a classifier \(\hat\Upphi\) can be understood as an estimation, or approximation, of an unknown “target function” Φ.

In Schapire and Singer (2000) the value for ε is chosen by 3-fold cross validation on the training set, but this procedure is reported to give only marginal improvements with respect to the default choice of \(\epsilon=\frac{1}{gm},\) which we adopt in this work.

Note that a local policy would also implement this view, but is not made possible by AdaBoost.MH, since this latter uses the same set of features for all the categories involved in the ML classification problem. This means that we need to use the same set of features for all categories in \(\downarrow({c_{j}}).\)

Our analysis is here in terms of the S original weak hypotheses rather than in terms of the \(\Updelta\) combined weak hypotheses; this is because different internal nodes have different values of \(\Updelta,\) which would needlessly complicate the notation and the discussion; the conclusions are not affected anyway.

Reuters-21578 is freely available for experimentation purposes from http://www.daviddlewis.com/resources/testcollections/~reuters21578/.

http://trec.nist.gov/data/reuters/reuters.html

http://computationalmedicine.org/challenge/index.php

http://www.cdc.gov/nchs/about/otheract/icd9/abticd9.htm

Note that many real-world classification schemes (e.g. the ACM Classification Scheme) are of this latter type, since their internal nodes usually have a special child category (called \({\mathsf{General}},\) or \({\mathsf{Other}}\)) which contains all documents belonging to the node but to none of its descendant leaves.

The reader might notice that the best performance we have obtained from AdaBoost.MH on Reuters-21578 (F ₁ ^μ = .808) is inferior to the one reported in Schapire and Singer (2000) for the same algorithm (F ₁ ^μ = .851). There are several reasons for this: (a) Schapire and Singer (2000) actually uses a different, much older version of this collection, called Reuters-21450 (Apté et al. 1994); (b) Schapire and Singer (2000) only uses the 93 categories which have at least 2 positive training examples and 1 positive test example, while we also use the categories that have just 1 positive training example and those that have no positive test example. This makes the two sets of AdaBoost.MH results difficult to compare.

Apté, C., Damerau, F. J., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233–251.CrossRef

Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04), pp. 78–87.

Ceci, M., & Malerba, D. (2007). Classifying Web documents in a hierarchy of categories: A comprehensive study. Journal of Intelligent Information Systems, 28(1), 37–78.CrossRef

Chakrabarti, S., Dom, B. E., Agrawal, R., & Raghavan, P. (1998). Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Journal of Very Large Data Bases, 7(3), 163–178.CrossRef

Cheng, C.-H., Tang, J., Wai-Chee, A., & King, I. (2001). Hierarchical classification of documents with error control. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01) (pp. 433–443). Hong Kong, CN.

Dumais, S. T., & Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00) (pp. 256–263). Athens, GR.

Fagni, T., & Sebastiani, F. (2007). On the selection of negative examples for hierarchical text categorization. In Proceedings of the 3rd Language & Technology Conference (LTC’07) (pp. 24–28). Poznań, PL.

Forman, G. (2004). A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). Banff, CA.

Gaussier, É., Goutte, C., Popat, K., & Chen, F. (2002). A hierarchical model for clustering and categorising documents. In Proceedings of the 24th European Colloquium on Information Retrieval Research (ECIR’02) (pp. 229–247). Glasgow, UK.

Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning (ICML’97) (pp. 170–178). Nashville, US.

Lewis, D. D. (1992). Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US.

Lewis, D. D., Li, F., Rose, T., & Yang, Y. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.

Liu, T. Y., Yang, Y., Wan, H., Zeng, H. J., Chen, Z., & Ma, W. Y. (2005). Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations, 7(1), 36–43.CrossRef

McCallum, A. K., Rosenfeld, R., Mitchell, T. M., Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML’98) (pp. 359–367). Madison, US.

Ng, H. T., Goh, W. B., Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th ACM International Conference on Research and Development in Information Retrieval (SIGIR’97) (pp. 67–73). Philadelphia, US.

Rose, T., Stevenson, M., & Whitehead, M. (2002). The Reuters Corpus Volume 1—from yesterday’s news to tomorrow’s language resources. In Proceedings of the 3rd International Conference on Language (LREC’02) Resources and Evaluation (pp. 827–832). Las Palmas, ES.

Ruiz, M., & Srinivasan, P. (2002). Hierarchical text classification using neural networks. Information Retrieval, 5(1), 87–118.CrossRefMATH

Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), 297–336.CrossRefMATH

Schapire, R. E., & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135–168.CrossRefMATH

Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR’98) (pp. 215–223). Melbourne, AU.

Sebastiani, F., Sperduti, A., & Valdambrini N. (2000). An improved boosting algorithm and its application to automated text categorization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM’00) (pp. 78–85). McLean, US.

Spiegel, M. R., & Stephens, L. J. (1999). Statistics (3rd ed.). New York, US: McGraw-Hill.

Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings of the 1st IEEE International Conference on Data Mining (ICDM-01) (pp. 521–528). San Jose, US.

Toutanova, K., Chen, F., Popat, K., & Hofmann, T. (2001). Text classification in a hierarchical mixture model for small training sets. In Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM’01) (pp. 105–113). Atlanta, US.

Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18(2/3), 153–172.CrossRef

Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193–216.CrossRef

Wiener, E. D., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95) (pp. 317–332). Las Vegas, US.

Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR’99) (pp. 42–49). Berkeley, US.

Yang, Y., Zhang, J., & Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In Proceedings of the 26th ACM International Conference on Research and Development in Information Retrieval (SIGIR’03) (pp. 96–103). Toronto, CA.

Title: Boosting multi-label hierarchical text categorization
Authors: Andrea Esuli
Tiziano Fagni
Fabrizio Sebastiani
Publication date: 01-08-2008
Publisher: Springer Netherlands
Published in: Discover Computing / Issue 4/2008
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-008-9047-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2008

Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance

Output-sensitive autocompletion search

A compressed self-index using a Ziv–Lempel dictionary

An outranking approach for information retrieval

Preface

Premium Partner