research-article

Refined experts: improving classification in large taxonomies

Authors:
Paul N. Bennett

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Nam Nguyen

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 11–18https://doi.org/10.1145/1571941.1571946

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 11–18

ABSTRACT

While large-scale taxonomies--especially for web pages--have been in existence for some time, approaches to automatically classify documents into these taxonomies have met with limited success compared to the more general progress made in text classification. We argue that this stems from three causes: increasing sparsity of training data at deeper nodes in the taxonomy, error propagation where a mistake made high in the hierarchy cannot be recovered, and increasingly complex decision surfaces in higher nodes in the hierarchy. While prior research has focused on the first problem, we introduce methods that target the latter two problems--first by biasing the training distribution to reduce error propagation and second by propagating up "first-guess" expert information in a bottom-up manner before making a refined top down choice. Finally, we present an empirical study demonstrating that the suggested changes lead to 10--30% improvements in F1 scores versus an accepted competitive baseline, hierarchical SVMs.

References

P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2004. Google ScholarDigital Library
C. M. Bishop and M. Svensén. Bayesian hierarchical mixtures of experts. In UAI '03, 2003. Google ScholarDigital Library
L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM '04, 2004. Google ScholarDigital Library
N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classification: combining bayes with svm. In ICML '06, 2006. Google ScholarDigital Library
N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7:31--54, 2006. Google ScholarDigital Library
O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In ICML '04, 2004. Google ScholarDigital Library
S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In CHI '01, 2001. Google ScholarDigital Library
S. T. Dumais and H. Chen. Hierarchical classification of Web content. In SIGIR '00, 2000. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML '98, 1998. Google ScholarDigital Library
M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181--214, 1994. Google ScholarDigital Library
A. R. Klivans and A. A. Sherstov. Improved lower bounds for learning intersections of halfspaces. In COLT '06, 2006. Google ScholarDigital Library
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97, 1997. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06, 2006. Google ScholarDigital Library
T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations, 7(1):36--43, 2005. Google ScholarDigital Library
A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML '98, 1998. Google ScholarDigital Library
D. M. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML '07, 2007. Google ScholarDigital Library
Netscape Communication Corporation. Open directory project. http://www.dmoz.org.Google Scholar
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, 1999.Google Scholar
M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In SIGIR '99, 1999. Google ScholarDigital Library
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-GrAdient solver for svm. In ICML '07, 2007. Google ScholarDigital Library
A. Sun and E. Lim. Hierarchical text classification and evaluation. In ICDM '01, 2001. Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Google ScholarDigital Library
G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08, 2008. Google ScholarDigital Library
Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR '99, 1999. Google ScholarDigital Library
B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma. Improving web search results using affinity graph. In SIGIR '05, 2005. Google ScholarDigital Library

Index Terms

Refined experts: improving classification in large taxonomies
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information systems applications

Recommendations

Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Read More
Text Classification from Labeled and Unlabeled Documents using EM
Special issue on information retrieval

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining ...
Read More
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large-scale hierarchy
text classification
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 82
  Total Citations
  View Citations
- 1,107
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Refined experts: improving classification in large taxonomies

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving Text Classification Accuracy by Training Label Cleaning

Text Classification from Labeled and Unlabeled Documents using EM

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values