ABSTRACT
Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6%, while top-down hierarchical categorisation accuracy can be improved by up to 12%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.
- C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarDigital Library
- L.D. Baker and A.K. McCallum. Distributional clustering of words for text classification. In R. Wilkinson, B. Croft, K. van Rijsbergen, A. Moffat, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 81--89, Melbourne, Australia, July 1998. Google ScholarDigital Library
- S. D'Alessio, K. Murray, R.Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302--313, Paris, 2000.Google Scholar
- S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown.Google Scholar
- S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N.J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Athens, 2000. Google ScholarDigital Library
- P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990. Google ScholarDigital Library
- T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning, pages 143--151, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), volume 1398, pages 137--142, Berlin, 1998. Springer. Google ScholarDigital Library
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter~11, pages 169--184. The MIT Press, 1999. Google ScholarDigital Library
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarDigital Library
- D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 298--306, Zurich, Switzerland, 1996. Google ScholarDigital Library
- D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, Pittsburg, USA, 1998.Google Scholar
- S.E. Robertson and K. Sparck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129--146, May 1976.Google ScholarCross Ref
- J.J. Rocchio. Relevance feedback in information retrieval. In The Smart Retrieval System --- Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.Google Scholar
- M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In M.A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 281--282, Berkeley, CA, 1999. Google ScholarDigital Library
- G. Salton, editor. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice-Hall, New Jersey, 1971. Google ScholarDigital Library
- H. Schütze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 229--237, Seattle, WA, 1995. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. Computing Surveys, 34(1):1--47, March 2002. Google ScholarDigital Library
- V. Shanks and H.E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pages 194--204, San Rafael, Chile, 2001.Google ScholarCross Ref
- C.J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarDigital Library
- A.S. Weigend, E.D. Wiener, and J.O. Pedersen. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193--216, 1999. Google ScholarDigital Library
- W. Wibowo and H.E. Williams. On using hierarchies for document classification. In Proc. Australian Document Computing Conference, pages 31--37, Coffs Harbour, Australia, 1999.Google Scholar
- H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries. To appear.Google Scholar
- I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999. Google ScholarDigital Library
- Y. Yang. Noise reduction in a statistical approach to text categorization. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Seattle, Washington, 1995. Google ScholarDigital Library
- Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In D.H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann Publishers, San Francisco. Google ScholarDigital Library
Index Terms
- Simple and accurate feature selection for hierarchical categorisation
Recommendations
Hierarchical classification of Web content
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrievalThis paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is ...
Strategies for minimising errors in hierarchical web categorisation
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementOn the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual ...
Comments