research-article

Mining multi-faceted overviews of arbitrary topics in a text collection

Authors:
Xu Ling

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Qiaozhu Mei

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
ChengXiang Zhai

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Bruce Schatz

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 497–505https://doi.org/10.1145/1401890.1401952

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 497–505

ABSTRACT

A common task in many text mining applications is to generate a multi-faceted overview of a topic in a text collection. Such an overview not only directly serves as an informative summary of the topic, but also provides a detailed view of navigation to different facets of the topic. Existing work has cast this problem as a categorization problem and requires training examples for each facet. This has three limitations: (1) All facets are predefined, which may not fit the need of a particular user. (2) Training examples for each facet are often unavailable. (3) Such an approach only works for a predefined type of topics. In this paper, we break these limitations and study a more realistic new setup of the problem, in which we would allow a user to flexibly describe each facet with keywords for an arbitrary topic and attempt to mine a multi-faceted overview in an unsupervised way. We attempt a probabilistic approach to solve this problem. Empirical experiments on different genres of text data show that our approach can effectively generate a multi-faceted overview for arbitrary topics; the generated overviews are comparable with those generated by supervised methods with training examples. They are also more informative than unstructured flat summaries. The method is quite general, thus can be applied to multiple text mining tasks in different application domains.

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In Proceedings of CHI '00, pages 145--152, 2000. Google ScholarDigital Library
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1):22--29, 1990. Google ScholarDigital Library
S. T. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In Proceedings of CHI '01, pages 277--284, 2001. Google ScholarDigital Library
D. Gruhl, R. Guha, R. Kumar, J. Novak, and A. Tomkins. The predictive power of online chatter. In Proceedings of KDD '05, pages 78--87, 2005. Google ScholarDigital Library
D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In Proceedings of WWW '04, pages 491--501, 2004. Google ScholarDigital Library
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR '96, pages 76--84, Zürich, CH, 1996. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR '99, pages 50--57, 1999. Google ScholarDigital Library
J. Jiang and C. Zhai. Exploiting domain structure for named entity recognition. In Proceedings of HLT-NAACL '06, pages 74--81, 2006. Google ScholarDigital Library
Kullback, S. and Leibler, R. A. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79--86, mar 1951.Google ScholarCross Ref
X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of PSB '06, pages 41--50, 2006.Google Scholar
X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. R. Schatz. Generating gene summaries from biomedical literature: A study of semi-structured summarization. Inf. Process. Manage., 43(6):1777--1791, 2007. Google ScholarDigital Library
B. Liu, M. Hu, and J. Cheng. Opinion observer: analyzing and comparing opinions on the web. In Proceedings of WWW '05, pages 342--351, 2005. Google ScholarDigital Library
Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In Proceedings of WWW '07, pages 121--130, 2008. Google ScholarDigital Library
G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997.Google Scholar
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceedings of WWW '08, pages 101--110, 2008. Google ScholarDigital Library
Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of WWW '07, pages 171--180, 2007. Google ScholarDigital Library
Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649--655, 2006. Google ScholarDigital Library
R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. pages 355--368, 1999.Google Scholar
P. Pirolli, P. Schank, M. Hearst, and C. Diehl. Scatter/gather browsing communicates the topic structure of a very large text collection. In Proceedings of CHI '96, pages 213--220, 1996. Google ScholarDigital Library
M. A. C. R. A. Drysdale and T. F. Consortium. Flybase: genes and gene models. Nucleic Acids Res., 33:390--395, 2005.Google Scholar
E. Stoica, M. Hearst, and M. Richardson. Automating creation of hierarchical faceted metadata structures. In Proceedings of NAACL/HLT '2007, pages 244--251, 2007.Google Scholar
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. Google ScholarDigital Library
X. Wang and C. Zhai. Learn from web search logs to organize search results. In Proceedings of SIGIR '07, pages 87--94, 2007. Google ScholarDigital Library
O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1361--1374, 1999. Google ScholarDigital Library
H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In Proceedings of SIGIR '04, pages 210--217, 2004. Google ScholarDigital Library
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04, pages 743--748, 2004. Google ScholarDigital Library
X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pages 912--919, 2003.Google ScholarDigital Library

Index Terms

Mining multi-faceted overviews of arbitrary topics in a text collection
1. Information systems
  1. Information retrieval

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media

Topic modeling is an important tool in social media analysis, allowing researchers to quickly understand large text corpora by investigating the topics underlying them. One of the fundamental problems of topic models lies in how to assess the quality of ...
Read More
Automatic labeling hierarchical topics
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Recently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in ...
Read More
Multi-grain hierarchical topic extraction algorithm for text mining

Topic extraction from text corpus is the fundamental of many topic analysis tasks, such as topic trend prediction, opinion extraction. Since hierarchical structure is characteristics of topics, it is preferential for a topic extraction algorithm to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multi-faceted overview
statistical topic models
text mining
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 1,019
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining multi-faceted overviews of arbitrary topics in a text collection

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Automatic labeling hierarchical topics

Multi-grain hierarchical topic extraction algorithm for text mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining multi-faceted overviews of arbitrary topics in a text collection

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Automatic labeling hierarchical topics

Multi-grain hierarchical topic extraction algorithm for text mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media