ABSTRACT
While large-scale taxonomies--especially for web pages--have been in existence for some time, approaches to automatically classify documents into these taxonomies have met with limited success compared to the more general progress made in text classification. We argue that this stems from three causes: increasing sparsity of training data at deeper nodes in the taxonomy, error propagation where a mistake made high in the hierarchy cannot be recovered, and increasingly complex decision surfaces in higher nodes in the hierarchy. While prior research has focused on the first problem, we introduce methods that target the latter two problems--first by biasing the training distribution to reduce error propagation and second by propagating up "first-guess" expert information in a bottom-up manner before making a refined top down choice. Finally, we present an empirical study demonstrating that the suggested changes lead to 10--30% improvements in F1 scores versus an accepted competitive baseline, hierarchical SVMs.
- P. N. Bennett, S. T. Dumais, and E. Horvitz. The combination of text classifiers using reliability indicators. Information Retrieval, 8(1):67--100, 2004. Google ScholarDigital Library
- C. M. Bishop and M. Svensén. Bayesian hierarchical mixtures of experts. In UAI '03, 2003. Google ScholarDigital Library
- L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM '04, 2004. Google ScholarDigital Library
- N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classification: combining bayes with svm. In ICML '06, 2006. Google ScholarDigital Library
- N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7:31--54, 2006. Google ScholarDigital Library
- O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In ICML '04, 2004. Google ScholarDigital Library
- S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In CHI '01, 2001. Google ScholarDigital Library
- S. T. Dumais and H. Chen. Hierarchical classification of Web content. In SIGIR '00, 2000. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML '98, 1998. Google ScholarDigital Library
- M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181--214, 1994. Google ScholarDigital Library
- A. R. Klivans and A. A. Sherstov. Improved lower bounds for learning intersections of halfspaces. In COLT '06, 2006. Google ScholarDigital Library
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97, 1997. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
- W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06, 2006. Google ScholarDigital Library
- T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations, 7(1):36--43, 2005. Google ScholarDigital Library
- A. McCallum, R. Rosenfeld, T. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML '98, 1998. Google ScholarDigital Library
- D. M. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In ICML '07, 2007. Google ScholarDigital Library
- Netscape Communication Corporation. Open directory project. http://www.dmoz.org.Google Scholar
- J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, 1999.Google Scholar
- M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In SIGIR '99, 1999. Google ScholarDigital Library
- S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-GrAdient solver for svm. In ICML '07, 2007. Google ScholarDigital Library
- A. Sun and E. Lim. Hierarchical text classification and evaluation. In ICDM '01, 2001. Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Google ScholarDigital Library
- G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08, 2008. Google ScholarDigital Library
- Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR '99, 1999. Google ScholarDigital Library
- B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma. Improving web search results using affinity graph. In SIGIR '05, 2005. Google ScholarDigital Library
Index Terms
- Refined experts: improving classification in large taxonomies
Recommendations
Improving Text Classification Accuracy by Training Label Cleaning
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Text Classification from Labeled and Unlabeled Documents using EM
Special issue on information retrievalThis paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Comments