skip to main content
10.1145/1367497.1367510acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Published:21 April 2008Publication History

ABSTRACT

This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.

References

  1. C. Andrieu, N. Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5--43, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Baker and A. McCallum. Distributional clustering of words for text classification. Proc. ACM SIGIR, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet & the Web: probabilistic methods & algorithms. Wiley, 2003.Google ScholarGoogle Scholar
  4. S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using Wikipedia. Proc. ACM SIGIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. JMLR, 3:1183--1208, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. Proc. SIAM SDM, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17--35, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  10. D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using Web search engines. Proc. WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co?training. Proc. COLT, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. Proc. ACM SIGIR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Cai, W. Lee, and Y. Teh. Improving WSD using topic features. Proc. EMNLP-CoNLL, 2007.Google ScholarGoogle Scholar
  14. S. Deerwester, G. Furnas, and T. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Info. Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  15. L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. ACM SIGIR Forum, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 29(2?3):103--130, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. IJCAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE PAMI, 6:721--741, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Griffiths and M. Steyvers. Finding scientific topics. The National Academy of Sciences, 101:5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  20. T. Joachims. Text categorization with SVMs: learning with many relevant features. Proc. ECML, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Heinrich. Parameter estimation for text analysis. Technical report, 2005.Google ScholarGoogle Scholar
  22. T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.Google ScholarGoogle Scholar
  23. T. Hofmann. Latent semantic models for collaborative filtering. ACM TOIS, 22(1):89--115, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Keller, M. Lapata, and O. Ourioupina. Using the Web to overcome data sparseness. Proc. EMNLP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. Proc. WWW, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Liu and J. Nocedal. On the limited memory BFGS method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. Proc. ECIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Proc. UAI, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Nigram, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2?3):103--134, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Sahami and T. Heilman. A Webηbased kernel function for measuring the similarity of short text snippets. Proc. WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Schonhofen. Identifying document topics using the Wikipedia category network. Proc. the IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Wei and W. Croft. LDA-based document models for ad-hoc retrieval. Proc. ACM SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Yih and C. Meek. Improving similarity measures for short segments of text. Proc. AAAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Proc. WWW, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster Web search results. Proc. ACM SIGIR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning to classify short and sparse text & web with hidden topics from large-scale data collections

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '08: Proceedings of the 17th international conference on World Wide Web
          April 2008
          1326 pages
          ISBN:9781605580852
          DOI:10.1145/1367497

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 April 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader