research-article

Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Authors:
Xuan-Hieu Phan

Tohoku University, Sendai, Japan

Tohoku University, Sendai, Japan
View Profile

,
Le-Minh Nguyen

Japan Advanced Institute of Science and Technology, Nomi, Japan

Japan Advanced Institute of Science and Technology, Nomi, Japan
View Profile

,
Susumu Horiguchi

Tohoku University, Sendai, Japan

Tohoku University, Sendai, Japan
View Profile

WWW '08: Proceedings of the 17th international conference on World Wide WebApril 2008Pages 91–100https://doi.org/10.1145/1367497.1367510

Published:21 April 2008Publication History

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 91–100

ABSTRACT

This paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement.

References

C. Andrieu, N. Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5--43, 2003.Google ScholarCross Ref
L. Baker and A. McCallum. Distributional clustering of words for text classification. Proc. ACM SIGIR, 1998. Google ScholarDigital Library
P. Baldi, P. Frasconi, and P. Smyth. Modeling the Internet & the Web: probabilistic methods & algorithms. Wiley, 2003.Google Scholar
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using Wikipedia. Proc. ACM SIGIR, 2007. Google ScholarDigital Library
A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. JMLR, 3:1183--1208, 2003. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. Proc. SIAM SDM, 2006.Google ScholarCross Ref
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
D. Blei and J. Lafferty. A correlated topic model of Science. The Annals of Applied Statistics, 1(1):17--35, 2007.Google ScholarCross Ref
D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using Web search engines. Proc. WWW, 2007. Google ScholarDigital Library
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co?training. Proc. COLT, 1998. Google ScholarDigital Library
L. Cai and T. Hofmann. Text categorization by boosting automatically extracted concepts. Proc. ACM SIGIR, 2003. Google ScholarDigital Library
J. Cai, W. Lee, and Y. Teh. Improving WSD using topic features. Proc. EMNLP-CoNLL, 2007.Google Scholar
S. Deerwester, G. Furnas, and T. Landauer. Indexing by latent semantic analysis. Journal of the American Society for Info. Science, 41(6):391--407, 1990.Google ScholarCross Ref
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. ACM SIGIR Forum, 2006. Google ScholarDigital Library
I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 29(2?3):103--130, 2001. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. IJCAI, 2007. Google ScholarDigital Library
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE PAMI, 6:721--741, 1984.Google ScholarDigital Library
T. Griffiths and M. Steyvers. Finding scientific topics. The National Academy of Sciences, 101:5228--5235, 2004.Google ScholarCross Ref
T. Joachims. Text categorization with SVMs: learning with many relevant features. Proc. ECML, 1998. Google ScholarDigital Library
G. Heinrich. Parameter estimation for text analysis. Technical report, 2005.Google Scholar
T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.Google Scholar
T. Hofmann. Latent semantic models for collaborative filtering. ACM TOIS, 22(1):89--115, 2004. Google ScholarDigital Library
F. Keller, M. Lapata, and O. Ourioupina. Using the Web to overcome data sparseness. Proc. EMNLP, 2002. Google ScholarDigital Library
K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. Proc. WWW, 2004. Google ScholarDigital Library
D. Liu and J. Nocedal. On the limited memory BFGS method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
D. Metzler, S. Dumais, and C. Meek. Similarity measures for short segments of text. Proc. ECIR, 2007. Google ScholarDigital Library
T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Proc. UAI, 2002. Google ScholarDigital Library
K. Nigram, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2?3):103--134, 2000. Google ScholarDigital Library
M. Sahami and T. Heilman. A Webηbased kernel function for measuring the similarity of short text snippets. Proc. WWW, 2006. Google ScholarDigital Library
P. Schonhofen. Identifying document topics using the Wikipedia category network. Proc. the IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
X. Wei and W. Croft. LDA-based document models for ad-hoc retrieval. Proc. ACM SIGIR, 2006. Google ScholarDigital Library
W. Yih and C. Meek. Improving similarity measures for short segments of text. Proc. AAAI, 2007. Google ScholarDigital Library
O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Proc. WWW, 1999. Google ScholarDigital Library
H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster Web search results. Proc. ACM SIGIR, 2004. Google ScholarDigital Library

Index Terms

Learning to classify short and sparse text & web with hidden topics from large-scale data collections
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media

Topic modeling is an important tool in social media analysis, allowing researchers to quickly understand large text corpora by investigating the topics underlying them. One of the fundamental problems of topic models lies in how to assess the quality of ...
Read More
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Read More
Improving short text classification by learning vector representations of both words and hidden topics

We exploit the knowledge from a topic-consistent corpus for topic modeling and use the topics to enrich the corpus and the short texts.We learn the vector representations of both words and topics interactively on the enriched corpus.We use the vectors ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '08: Proceedings of the 17th international conference on World Wide Web
April 2008
1326 pages
ISBN:9781605580852
DOI:10.1145/1367497
General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
sparse text
topic analysis
web data analysis/classification
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 514
  Total Citations
  View Citations
- 4,015
  Total Downloads
- Downloads (Last 12 months)160
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to classify short and sparse text & web with hidden topics from large-scale data collections

WWW '08: Proceedings of the 17th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Topic analysis for topic-focused multi-document summarization

Improving short text classification by learning vector representations of both words and hidden topics