research-article

Integrating Document Clustering and Multidocument Summarization

Authors:
Dingding Wang

Florida International University

Florida International University
View Profile

,
Shenghuo Zhu

NEC Laboratories America

NEC Laboratories America
View Profile

,
Tao Li

Florida International University

Florida International University
View Profile

,
Yun Chi

NEC Laboratories America

NEC Laboratories America
View Profile

,
Yihong Gong

NEC Laboratories America

NEC Laboratories America
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 5 Issue 3Article No.: 14pp 1–26https://doi.org/10.1145/1993077.1993078

Published:01 August 2011Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.

References

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2002. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani Eds. MIT Press, Cambridge, MA, 601--608.Google Scholar
Cho, H., Dhillon, I., Guan, Y., and Sra, S. 2004. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of SIAM International Conference on Data Mining.Google Scholar
Conroy, J. and O’Leary, D. 2001. Text summarization via hidden Markov models. In Proceedings of SIGIR. 406--407. Google ScholarDigital Library
Devore, J. and Peck, R. 1977. Statistics: The Exploration and Analysis of Data. Duxbury Press.Google Scholar
Dhillon, I. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 269--274. Google ScholarDigital Library
Dhillon, I., Mallela, S., and Modha, S. 2001. Information-theoretic co-clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 89--98. Google ScholarDigital Library
Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 107--114. Google ScholarDigital Library
Ding, C., Li, T., Peng, W., and Park, H. 2006. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 126--135. Google ScholarDigital Library
Duda, R., Hart, P., and Stork, D. 2001. Pattern Classification. John Wiley and Sons, Inc. Google ScholarDigital Library
Dunlavy, D., O’Leary, D., Conroy, J., and Schlesinger, J. 2007. QCS: A system for querying, clustering and summarizing documents. Inform. Process. Manag. Int. J. Google ScholarDigital Library
Elkan, C. 2006. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In Proceedings of International Conference on Machine Learning (ICML). 289--296. Google ScholarDigital Library
Erkan, G. and Radev, D. 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of International Conference on Empirical Method on Natural Language Processing (EMNLP).Google Scholar
Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 121--128. Google ScholarDigital Library
Gong, Y. and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 75--95. Google ScholarDigital Library
He, J., Lan, M., Tan, C., Sung, S., and Low, H. 2004. Initialization of cluster reffinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN).Google Scholar
Hoffman, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
Jing, H. and McKeown, K. 2000. Cut and paste based text summarization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Google ScholarDigital Library
Jing, L., Ng, M. K., and Huang, J. Z. 2007. An entropy weighting k-means algorithm for subspace clustering of high dimensional sparse data. IEEE Trans. Knowl. Data Eng. 1026--1041. Google ScholarDigital Library
Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: A probablistic approach to sentence compression. Artif. Intell. 91--107. Google ScholarDigital Library
Lee, D. D. and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In Proceedings of the Conference on Neural Information Processing Systems (NIPS).Google Scholar
Li, T. 2005. A general model for clustering binary data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 188--197. Google ScholarDigital Library
Li, T. and Ding, C. 2006. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE International Conference on Data Mining. 362--371. Google ScholarDigital Library
Li, T., Ma, S., and Ogihara, M. 2004. Document clustering via adaptive subspace iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 218--225. Google ScholarDigital Library
Lin, C.-Y. and Hovy, E. 2001. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of Association for Computational Linguistics (ACL). 457--464. Google ScholarDigital Library
Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NLT-NAACL). 71--78. Google ScholarDigital Library
Liu, X., Gong, Y., Xu, W., and Zhu, S. 2003. Document clustering with cluster refinement and model selection capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 191--198. Google ScholarDigital Library
Long, B., Wu, X., Zhang, Z. M., and Yu, P. S. 2006. Unsupervised learning on k-partite graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 317--326. Google ScholarDigital Library
Long, C., Huang, M., Zhu, X., and Li, M. 2009. Multi-document summarization by information distance. In Proceedings of International Conference on Data Mining (ICDM). Google ScholarDigital Library
Mana-Lopez, M. J., Buenaga, M. D., and Gomez-Hidalgo, J. M. 2004. Multidocument summarization: An added value to clustering in interactive retrieval. ACM Trans. Inform. Syst. Google ScholarDigital Library
Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company.Google Scholar
McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., and Sigelman, S. 2002. Tracking and summarizing news on a daily basis with Columbia’s newsblaster. In Proceedings of the 2nd International Conference on Human Language Technology Research. Google ScholarDigital Library
Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of International Conference on Natural Language Processing (IJCNLP).Google Scholar
Nastase, V. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP). 763--772. Google ScholarDigital Library
Park, S., Lee, J.-H., Kim, D.-H., and Ahn, C.-M. 2007. Multi-document summarization based on cluster using non-negtive matrix factorization. In Proceedings of Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM). Google ScholarDigital Library
Radev, D., Jing, H., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inform. Process. Manag., 919--938. Google ScholarDigital Library
Ricardo, B. and Berthier, R. 1999. Modern Information Retrieval. ACM Press. Google ScholarDigital Library
Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2862--2867. Google ScholarDigital Library
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 888--905. Google ScholarDigital Library
Strehl, A. and Ghosh, J. 2003. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 583--617. Google ScholarDigital Library
Tang, J., Yao, L., and Chen, D. 2009. Multi-topic based query-oriented Summarization. In Proceedings of SIAM International Conference on Data Mining (SDM).Google Scholar
Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. 2007. Fast generation of result snippets in Web search. In Prodeedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 127--134. Google ScholarDigital Library
Wan, X. and Xiao, J. 2009. Graph-based multi-modality learning for topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1586--1591. Google ScholarDigital Library
Wan, X. and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31st ACM Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
Wan, X., Yang, J., and Xiao, J. 2007. Manifold-ranking based topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2903--2908. Google ScholarDigital Library
Wang, D., Li, T., Zhu, S., and Ding, C. 2008a. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
Wang, D., Zhu, S., Li, T., Chi, Y., and Gong, Y. 2008b. Integrating clustering and multi-document summarization to improve document understanding. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM). 1435--1436. Google ScholarDigital Library
Wang, D., Zhu, S., Li, T., and Gong, Y. 2009. Multi-document summarization using sentence-based topic models. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarDigital Library
Wang, F., Zhang, C., and Li, T. 2007. Regularized clustering for documents. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 95--102. Google ScholarDigital Library
Wei, F., Li, W., Lu, Q., and He, Y. 2008. Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval. ACM, 283--290. Google ScholarDigital Library
Xu, W. and Gong, Y. 2004. Document clustering by concept factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 202--209. Google ScholarDigital Library
Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 373--386. Google ScholarDigital Library
Yih, W.-T., Goodman, J., Vanderwende, L., and Suzuki, H. 2007. Multi-document summarization by maximizing informative content-words. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1776--1782. Google ScholarDigital Library
Zamir, O. and Etzioni, O. 1998. Web document clustering: A feasibility demonstratio. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 46--54. Google ScholarDigital Library
Zha, H., He, X., Ding, C., Gu, M., and Simon, H. 2001. Bipartite graph partitioning and data clustering. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). 25--32. Google ScholarDigital Library
Zhong, S. and Ghosh, J. 2003. A unified framework for model-based clustering. J. Mach. Learn. Res., 1001--1037. Google ScholarDigital Library
Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst., 374--384. Google ScholarDigital Library
Zien, J., Schlag, M., and Chan, P. K. 1999. Multilevel spectral hypergraph partitioning with artibary vertex sizes. IEEE Trans. Comput.-Aid. Design. 1389--1399. Google ScholarDigital Library

Index Terms

Integrating Document Clustering and Multidocument Summarization
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval

Recommendations

Integrating clustering and multi-document summarization to improve document understanding
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Document understanding techniques such as document clustering and multi-document summarization have been receiving much attention in recent years. Current document clustering methods usually represent documents as a term-document matrix and perform ...
Read More
Multidocument summarization: An added value to clustering in interactive retrieval

A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar ...
Read More
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology Research

This paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Read More

Reviews

Reviewer: Amos O Olagunju

Future information retrieval systems require effective document grouping and condensation algorithms to aid users in the interpretation of retrieved documents. The design of such algorithms for generating meaningful and interpretable document summaries poses challenges, despite the available methods for information retrieval systems in the literature [1]. How should retrieved documents be effectively clustered and abridged for meaningful understanding__?__ Wang et al. present a sentence-based factoring language framework for simultaneously grouping and abridging documents. This framework minimizes the divergence between documents and terms to generate document-term and sentence-term matrices from a collection of documents and to produce meaningful document summaries for interpretation. Specifically, the sentence-based factoring language framework assigns documents to topics according to the degree of relevance, and then extracts a summary from those sentences whose topics have the highest probabilities of relevance in the document collection. This framework constructs clusters of documents, derives the probabilities for the documents and sentences, and generates the scores for deciding on the summary of each document cluster. It eliminates formatting characters and common words (nonindex terms) from a document collection prior to generating the document-term and sentence-term matrices. It then uses the sentence-term matrix to perform nonnegative matrix factorization on the document-term matrix to produce document-topic and sentence-topic matrices and consequently to generate the document clusters and summaries. The authors undoubtedly present a reliable computational algorithm for performing nonnegative matrix factorization based on the Dirichlet distribution [2] and parameter estimation by a method of maximum likelihood. They use a synthetic dataset to clearly illustrate how the presented framework functions. They conducted experiments with prototypical document collections to evaluate the effectiveness of the approach in clustering documents and generating summaries. Their framework significantly outperformed well-known algorithms such as k -means, information-theoretic clustering, Euclidean co-clustering and minimum sum-squared co-clustering, and nonnegative matrix factorization in generating accurate document clusters. In addition, it outperformed most recent information retrieval systems that use semantic manifold ranking and analysis to summarize documents. The experimental results convincingly show that the sentence-based factoring language framework is useful for mining the contextual facts rooted in a document collection to produce a meaningful summary of documents. The sentence-based factoring language framework shows promise for extracting the meaning of documents in future information retrieval systems. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 5, Issue 3
August 2011
119 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1993077
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2011
- Accepted: 1 July 2010
- Revised: 1 February 2010
- Received: 1 June 2009
Published in tkdd Volume 5, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Document clustering
multidocument summarization
nonnegative matrix factorization with given bases
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 1,334
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Integrating clustering and multi-document summarization to improve document understanding

Multidocument summarization: An added value to clustering in interactive retrieval

Experiments in multidocument summarization

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Integrating clustering and multi-document summarization to improve document understanding

Multidocument summarization: An added value to clustering in interactive retrieval

Experiments in multidocument summarization

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media