ABSTRACT
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.
- P. N. Bennett, S. T. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proc. of SIGIR'02, pages 207--215, 2002. Google ScholarDigital Library
- C. Blake and C. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.Google Scholar
- A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In NAACL Workshop on WordNet and Other Lexical Resources, 2001.Google Scholar
- S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarDigital Library
- D. Cohen, M. Herscovici, Y. Petruschka, Y. S. Maarek, A. Soffer, and D. Newbold. Personalized pocket directories for mobile devices. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarDigital Library
- R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.Google ScholarDigital Library
- S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR'00, pages 256--263, 2000. Google ScholarDigital Library
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, pages 148--155, 1998. Google ScholarDigital Library
- C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.Google ScholarCross Ref
- E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. To appear in ICML'04, 2004. Google ScholarDigital Library
- R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In SIGKDD Workshop on Text Mining, 2000.Google Scholar
- D. Harman. The DARPA TIPSTER project. In SIGIR Forum, volume 26(2), pages 26--28. ACM, 1992. Google ScholarDigital Library
- W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94, pages 192--201, 1994. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137--142, 1998. Google ScholarDigital Library
- T. Joachims. Making large-scale SVM learning practical. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning. The MIT Press, 1999. Google ScholarDigital Library
- Y. Labrou and T. Finin. Yahoo! as an ontology---using Yahoo! categories to describe documents. In CIKM'99, pages 180--187, 1999. Google ScholarDigital Library
- W. Lam and K.-Y. Lai. A meta-learning approach for text categorization. In SIGIR'01, pages 303--309, 2001. Google ScholarDigital Library
- K. Lang. Newsweeder: Learning to filter netnews. In ICML'95, pages 331--339, 1995.Google ScholarCross Ref
- D. D. Lewis. Evaluating text categorization. In Proc. of the Speech and Natural Language Workshop, pages 312--318. Morgan Kaufmann, February 1991. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004. Google ScholarDigital Library
- W. Meng, W. Wang, H. Sun, and C. Yu. Concept hierarchy-based text database categorization. Knowledge and Information Systems, 4:132--150, 2002.Google ScholarDigital Library
- Medical subject headings (MeSH). National Library of Medicine, 2003. http://www.nlm.nih.gov/mesh.Google Scholar
- D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In Proc. of 7th Electrotech. and Comp. Sci. Conf., pages 145--148, 1998.Google Scholar
- W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1997.Google Scholar
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarDigital Library
- R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304--310, September 1989.Google ScholarCross Ref
- P. Resnik. Semantic similarity in a taxonomy. JAIR, 11:95--130, 1999.Google ScholarCross Ref
- Reuters. Reuters-21578 text categorization test collection, Distribution 1.0, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578.Google Scholar
- J. Rowling. Harry Potter and the Goblet of Fire. Bloomsbury, 2001.Google Scholar
- C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic association of web directories to word senses. Computational Linguistics, 29(3), 2003. Google ScholarDigital Library
- S. Scott. Feature engineering for a symbolic approach to text classification. Master's thesis, U. Ottawa, 1998.Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995. Google ScholarDigital Library
- Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. JIIS, 18(2/3):219--241, 2002. Google ScholarDigital Library
Index Terms
- Parameterized generation of labeled datasets for text categorization based on a hierarchical directory
Recommendations
Synthetic Generation of High-Dimensional Datasets
Generation of synthetic datasets is a common practice in many research areas. Such data is often generated to meet specific needs or certain conditions that may not be easily found in the original, real data. The nature of the data varies according to ...
Arabic Text Categorization Based on Arabic Wikipedia
This article describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based datasets obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization ...
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Comments