Article

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Authors:
Dmitry Davidov

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Evgeniy Gabrilovich

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Shaul Markovitch

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 250–257https://doi.org/10.1145/1008992.1009036

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 250–257

ABSTRACT

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.

References

P. N. Bennett, S. T. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proc. of SIGIR'02, pages 207--215, 2002. Google ScholarDigital Library
C. Blake and C. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.Google Scholar
A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In NAACL Workshop on WordNet and Other Lexical Resources, 2001.Google Scholar
S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarDigital Library
D. Cohen, M. Herscovici, Y. Petruschka, Y. S. Maarek, A. Soffer, and D. Newbold. Personalized pocket directories for mobile devices. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarDigital Library
R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.Google ScholarDigital Library
S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR'00, pages 256--263, 2000. Google ScholarDigital Library
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, pages 148--155, 1998. Google ScholarDigital Library
C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.Google ScholarCross Ref
E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. To appear in ICML'04, 2004. Google ScholarDigital Library
R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In SIGKDD Workshop on Text Mining, 2000.Google Scholar
D. Harman. The DARPA TIPSTER project. In SIGIR Forum, volume 26(2), pages 26--28. ACM, 1992. Google ScholarDigital Library
W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94, pages 192--201, 1994. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137--142, 1998. Google ScholarDigital Library
T. Joachims. Making large-scale SVM learning practical. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning. The MIT Press, 1999. Google ScholarDigital Library
Y. Labrou and T. Finin. Yahoo! as an ontology---using Yahoo! categories to describe documents. In CIKM'99, pages 180--187, 1999. Google ScholarDigital Library
W. Lam and K.-Y. Lai. A meta-learning approach for text categorization. In SIGIR'01, pages 303--309, 2001. Google ScholarDigital Library
K. Lang. Newsweeder: Learning to filter netnews. In ICML'95, pages 331--339, 1995.Google ScholarCross Ref
D. D. Lewis. Evaluating text categorization. In Proc. of the Speech and Natural Language Workshop, pages 312--318. Morgan Kaufmann, February 1991. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004. Google ScholarDigital Library
W. Meng, W. Wang, H. Sun, and C. Yu. Concept hierarchy-based text database categorization. Knowledge and Information Systems, 4:132--150, 2002.Google ScholarDigital Library
Medical subject headings (MeSH). National Library of Medicine, 2003. http://www.nlm.nih.gov/mesh.Google Scholar
D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In Proc. of 7th Electrotech. and Comp. Sci. Conf., pages 145--148, 1998.Google Scholar
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1997.Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304--310, September 1989.Google ScholarCross Ref
P. Resnik. Semantic similarity in a taxonomy. JAIR, 11:95--130, 1999.Google ScholarCross Ref
Reuters. Reuters-21578 text categorization test collection, Distribution 1.0, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578.Google Scholar
J. Rowling. Harry Potter and the Goblet of Fire. Bloomsbury, 2001.Google Scholar
C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic association of web directories to word senses. Computational Linguistics, 29(3), 2003. Google ScholarDigital Library
S. Scott. Feature engineering for a symbolic approach to text classification. Master's thesis, U. Ottawa, 1998.Google Scholar
F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995. Google ScholarDigital Library
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. JIIS, 18(2/3):219--241, 2002. Google ScholarDigital Library

Index Terms

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Synthetic Generation of High-Dimensional Datasets

Generation of synthetic datasets is a common practice in many research areas. Such data is often generated to meet specific needs or certain conditions that may not be easily found in the original, real data. The nature of the data varies according to ...
Read More
Arabic Text Categorization Based on Arabic Wikipedia

This article describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based datasets obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization ...
Read More
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 1,114
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Synthetic Generation of High-Dimensional Datasets

Arabic Text Categorization Based on Arabic Wikipedia

Cross-lingual text categorization: Conquering language boundaries in globalized environments