Article

Simple and accurate feature selection for hierarchical categorisation

Authors:
Wahyu Wibowo

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Hugh E. Williams

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

DocEng '02: Proceedings of the 2002 ACM symposium on Document engineeringNovember 2002Pages 111–118https://doi.org/10.1145/585058.585079

Published:08 November 2002Publication History

DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering

Pages 111–118

ABSTRACT

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6%, while top-down hierarchical categorisation accuracy can be improved by up to 12%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.

References

C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994. Google ScholarDigital Library
L.D. Baker and A.K. McCallum. Distributional clustering of words for text classification. In R. Wilkinson, B. Croft, K. van Rijsbergen, A. Moffat, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 81--89, Melbourne, Australia, July 1998. Google ScholarDigital Library
S. D'Alessio, K. Murray, R.Schiaffino, and A. Kershenbaum. The effect of using hierarchical classifiers in text categorization. In Proceeding of RIAO-00, 6th International Conference "Recherche d'Information Assistee par Ordinateur", pages 302--313, Paris, 2000.Google Scholar
S. D'Alessio, K. Murray, R. Schiaffino, and A. Kershenbaum. Category levels in hierarchical text categorization. In Proc. of EMNLP-98, 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain, 1998. Association for Computational Linguistics, Morristown.Google Scholar
S. T. Dumais and H. Chen. Hierarchical classification of Web content. In N.J. Belkin, P. Ingwersen, and M.-K. Leong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Athens, 2000. Google ScholarDigital Library
P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990. Google ScholarDigital Library
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning, pages 143--151, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), volume 1398, pages 137--142, Berlin, 1998. Springer. Google ScholarDigital Library
T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter~11, pages 169--184. The MIT Press, 1999. Google ScholarDigital Library
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann, San Francisco. Google ScholarDigital Library
D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka. Training algorithms for linear text classifiers. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 298--306, Zurich, Switzerland, 1996. Google ScholarDigital Library
D. Mladenic and M. Grobelnik. Feature selection for classification based on text hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98, Pittsburg, USA, 1998.Google Scholar
S.E. Robertson and K. Sparck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129--146, May 1976.Google ScholarCross Ref
J.J. Rocchio. Relevance feedback in information retrieval. In The Smart Retrieval System --- Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.Google Scholar
M. E. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization. In M.A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 281--282, Berkeley, CA, 1999. Google ScholarDigital Library
G. Salton, editor. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice-Hall, New Jersey, 1971. Google ScholarDigital Library
H. Schütze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 229--237, Seattle, WA, 1995. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. Computing Surveys, 34(1):1--47, March 2002. Google ScholarDigital Library
V. Shanks and H.E. Williams. Fast categorisation of large document collections. In 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pages 194--204, San Rafael, Chile, 2001.Google ScholarCross Ref
C.J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarDigital Library
A.S. Weigend, E.D. Wiener, and J.O. Pedersen. Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193--216, 1999. Google ScholarDigital Library
W. Wibowo and H.E. Williams. On using hierarchies for document classification. In Proc. Australian Document Computing Conference, pages 31--37, Coffs Harbour, Australia, 1999.Google Scholar
H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries. To appear.Google Scholar
I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999. Google ScholarDigital Library
Y. Yang. Noise reduction in a statistical approach to text categorization. In E.A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 256--263, Seattle, Washington, 1995. Google ScholarDigital Library
Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In D.H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann Publishers, San Francisco. Google ScholarDigital Library

Index Terms

Simple and accurate feature selection for hierarchical categorisation
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Hierarchical classification of Web content
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is ...
Read More
Strategies for minimising errors in hierarchical web categorisation
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual ...
Read More
Hierarchical Text Categorization: Algorithms, Evaluation, and Applications
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering
November 2002
168 pages
ISBN:1581135947
DOI:10.1145/585058
General Chair:
Ethan Munson
University of Wisconsin-Milwaukee, USA
,
Program Chairs:
Richard Furuta
Texas A&M University, USA
,
Jonathan I. Maletic
Kent State University, USA
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 November 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
categorisation
error reduction
hierarchical categorisation
web hierarchies
Qualifiers
- Article
Conference

Acceptance Rates
DocEng '02 Paper Acceptance Rate21of46submissions,46%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 721
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Simple and accurate feature selection for hierarchical categorisation

DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hierarchical classification of Web content

Strategies for minimising errors in hierarchical web categorisation

Hierarchical Text Categorization: Algorithms, Evaluation, and Applications