skip to main content
10.1145/1008992.1009036acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Published:25 July 2004Publication History

ABSTRACT

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.

References

  1. P. N. Bennett, S. T. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In Proc. of SIGIR'02, pages 207--215, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Blake and C. Merz. UCI Repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.Google ScholarGoogle Scholar
  3. A. Budanitsky and G. Hirst. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In NAACL Workshop on WordNet and Other Lexical Resources, 2001.Google ScholarGoogle Scholar
  4. S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cohen, M. Herscovici, Y. Petruschka, Y. S. Maarek, A. Soffer, and D. Newbold. Personalized pocket directories for mobile devices. In Proc. of the Int'l World Wide Web Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR'00, pages 256--263, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM, pages 148--155, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  10. E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. To appear in ICML'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In SIGKDD Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  12. D. Harman. The DARPA TIPSTER project. In SIGIR Forum, volume 26(2), pages 26--28. ACM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94, pages 192--201, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML'98, pages 137--142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims. Making large-scale SVM learning practical. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods -- Support Vector Learning. The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Labrou and T. Finin. Yahoo! as an ontology---using Yahoo! categories to describe documents. In CIKM'99, pages 180--187, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Lam and K.-Y. Lai. A meta-learning approach for text categorization. In SIGIR'01, pages 303--309, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Lang. Newsweeder: Learning to filter netnews. In ICML'95, pages 331--339, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. D. Lewis. Evaluating text categorization. In Proc. of the Speech and Natural Language Workshop, pages 312--318. Morgan Kaufmann, February 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Meng, W. Wang, H. Sun, and C. Yu. Concept hierarchy-based text database categorization. Knowledge and Information Systems, 4:132--150, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Medical subject headings (MeSH). National Library of Medicine, 2003. http://www.nlm.nih.gov/mesh.Google ScholarGoogle Scholar
  23. D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In Proc. of 7th Electrotech. and Comp. Sci. Conf., pages 145--148, 1998.Google ScholarGoogle Scholar
  24. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1997.Google ScholarGoogle Scholar
  25. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304--310, September 1989.Google ScholarGoogle ScholarCross RefCross Ref
  27. P. Resnik. Semantic similarity in a taxonomy. JAIR, 11:95--130, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  28. Reuters. Reuters-21578 text categorization test collection, Distribution 1.0, 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578.Google ScholarGoogle Scholar
  29. J. Rowling. Harry Potter and the Goblet of Fire. Bloomsbury, 2001.Google ScholarGoogle Scholar
  30. C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic association of web directories to word senses. Computational Linguistics, 29(3), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Scott. Feature engineering for a symbolic approach to text classification. Master's thesis, U. Ottawa, 1998.Google ScholarGoogle Scholar
  32. F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. JIIS, 18(2/3):219--241, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
          July 2004
          624 pages
          ISBN:1581138814
          DOI:10.1145/1008992

          Copyright © 2004 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 July 2004

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader