ABSTRACT
This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.
- C. Andrieu, N. Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5--43, 2003.Google ScholarCross Ref
- L. Baker and A. McCallum. Distributional clustering of words for text classification. Proc. ACM SIGIR, 1998. Google ScholarDigital Library
- P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet & the Web: probabilistic methods & algorithms. Wiley, 2003.Google Scholar
- S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using Wikipedia. Proc. ACM SIGIR, 2007. Google ScholarDigital Library
- A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
- R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. JMLR, 3:1183--1208, 2003. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. Proc. SIAM SDM, 2006.Google ScholarCross Ref
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- D. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17--35, 2007.Google ScholarCross Ref
- D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using Web search engines. Proc. WWW, 2007. Google ScholarDigital Library
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co?training. Proc. COLT, 1998. Google ScholarDigital Library
- L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. Proc. ACM SIGIR, 2003. Google ScholarDigital Library
- J. Cai, W. Lee, and Y. Teh. Improving WSD using topic features. Proc. EMNLP-CoNLL, 2007.Google Scholar
- S. Deerwester, G. Furnas, and T. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Info. Science, 41(6):391--407, 1990.Google ScholarCross Ref
- L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. ACM SIGIR Forum, 2006. Google ScholarDigital Library
- I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 29(2?3):103--130, 2001. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. IJCAI, 2007. Google ScholarDigital Library
- S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE PAMI, 6:721--741, 1984.Google ScholarDigital Library
- T. Griffiths and M. Steyvers. Finding scientific topics. The National Academy of Sciences, 101:5228--5235, 2004.Google ScholarCross Ref
- T. Joachims. Text categorization with SVMs: learning with many relevant features. Proc. ECML, 1998. Google ScholarDigital Library
- G. Heinrich. Parameter estimation for text analysis. Technical report, 2005.Google Scholar
- T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.Google Scholar
- T. Hofmann. Latent semantic models for collaborative filtering. ACM TOIS, 22(1):89--115, 2004. Google ScholarDigital Library
- F. Keller, M. Lapata, and O. Ourioupina. Using the Web to overcome data sparseness. Proc. EMNLP, 2002. Google ScholarDigital Library
- K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. Proc. WWW, 2004. Google ScholarDigital Library
- D. Liu and J. Nocedal. On the limited memory BFGS method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
- D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. Proc. ECIR, 2007. Google ScholarDigital Library
- T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Proc. UAI, 2002. Google ScholarDigital Library
- K. Nigram, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2?3):103--134, 2000. Google ScholarDigital Library
- M. Sahami and T. Heilman. A Webηbased kernel function for measuring the similarity of short text snippets. Proc. WWW, 2006. Google ScholarDigital Library
- P. Schonhofen. Identifying document topics using the Wikipedia category network. Proc. the IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- X. Wei and W. Croft. LDA-based document models for ad-hoc retrieval. Proc. ACM SIGIR, 2006. Google ScholarDigital Library
- W. Yih and C. Meek. Improving similarity measures for short segments of text. Proc. AAAI, 2007. Google ScholarDigital Library
- O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Proc. WWW, 1999. Google ScholarDigital Library
- H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster Web search results. Proc. ACM SIGIR, 2004. Google ScholarDigital Library
Index Terms
- Learning to classify short and sparse text & web with hidden topics from large-scale data collections
Recommendations
Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social MediaTopic modeling is an important tool in social media analysis, allowing researchers to quickly understand large text corpora by investigating the topics underlying them. One of the fundamental problems of topic models lies in how to assess the quality of ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTopic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Improving short text classification by learning vector representations of both words and hidden topics
We exploit the knowledge from a topic-consistent corpus for topic modeling and use the topics to enrich the corpus and the short texts.We learn the vector representations of both words and topics interactively on the enriched corpus.We use the vectors ...
Comments