skip to main content
research-article

Integrating Document Clustering and Multidocument Summarization

Published:01 August 2011Publication History
Skip Abstract Section

Abstract

Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.

References

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2002. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani Eds. MIT Press, Cambridge, MA, 601--608.Google ScholarGoogle Scholar
  2. Cho, H., Dhillon, I., Guan, Y., and Sra, S. 2004. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of SIAM International Conference on Data Mining.Google ScholarGoogle Scholar
  3. Conroy, J. and O’Leary, D. 2001. Text summarization via hidden Markov models. In Proceedings of SIGIR. 406--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Devore, J. and Peck, R. 1977. Statistics: The Exploration and Analysis of Data. Duxbury Press.Google ScholarGoogle Scholar
  5. Dhillon, I. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 269--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dhillon, I., Mallela, S., and Modha, S. 2001. Information-theoretic co-clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ding, C., Li, T., Peng, W., and Park, H. 2006. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Duda, R., Hart, P., and Stork, D. 2001. Pattern Classification. John Wiley and Sons, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dunlavy, D., O’Leary, D., Conroy, J., and Schlesinger, J. 2007. QCS: A system for querying, clustering and summarizing documents. Inform. Process. Manag. Int. J. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Elkan, C. 2006. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In Proceedings of International Conference on Machine Learning (ICML). 289--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Erkan, G. and Radev, D. 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of International Conference on Empirical Method on Natural Language Processing (EMNLP).Google ScholarGoogle Scholar
  13. Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 121--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gong, Y. and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 75--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. He, J., Lan, M., Tan, C., Sung, S., and Low, H. 2004. Initialization of cluster reffinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN).Google ScholarGoogle Scholar
  16. Hoffman, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jing, H. and McKeown, K. 2000. Cut and paste based text summarization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jing, L., Ng, M. K., and Huang, J. Z. 2007. An entropy weighting k-means algorithm for subspace clustering of high dimensional sparse data. IEEE Trans. Knowl. Data Eng. 1026--1041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: A probablistic approach to sentence compression. Artif. Intell. 91--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lee, D. D. and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In Proceedings of the Conference on Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  21. Li, T. 2005. A general model for clustering binary data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 188--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Li, T. and Ding, C. 2006. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE International Conference on Data Mining. 362--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Li, T., Ma, S., and Ogihara, M. 2004. Document clustering via adaptive subspace iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 218--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lin, C.-Y. and Hovy, E. 2001. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of Association for Computational Linguistics (ACL). 457--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NLT-NAACL). 71--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Liu, X., Gong, Y., Xu, W., and Zhu, S. 2003. Document clustering with cluster refinement and model selection capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Long, B., Wu, X., Zhang, Z. M., and Yu, P. S. 2006. Unsupervised learning on k-partite graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Long, C., Huang, M., Zhu, X., and Li, M. 2009. Multi-document summarization by information distance. In Proceedings of International Conference on Data Mining (ICDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mana-Lopez, M. J., Buenaga, M. D., and Gomez-Hidalgo, J. M. 2004. Multidocument summarization: An added value to clustering in interactive retrieval. ACM Trans. Inform. Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company.Google ScholarGoogle Scholar
  31. McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., and Sigelman, S. 2002. Tracking and summarizing news on a daily basis with Columbia’s newsblaster. In Proceedings of the 2nd International Conference on Human Language Technology Research. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of International Conference on Natural Language Processing (IJCNLP).Google ScholarGoogle Scholar
  33. Nastase, V. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP). 763--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Park, S., Lee, J.-H., Kim, D.-H., and Ahn, C.-M. 2007. Multi-document summarization based on cluster using non-negtive matrix factorization. In Proceedings of Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Radev, D., Jing, H., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inform. Process. Manag., 919--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ricardo, B. and Berthier, R. 1999. Modern Information Retrieval. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2862--2867. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 888--905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Strehl, A. and Ghosh, J. 2003. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 583--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tang, J., Yao, L., and Chen, D. 2009. Multi-topic based query-oriented Summarization. In Proceedings of SIAM International Conference on Data Mining (SDM).Google ScholarGoogle Scholar
  41. Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. 2007. Fast generation of result snippets in Web search. In Prodeedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 127--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wan, X. and Xiao, J. 2009. Graph-based multi-modality learning for topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1586--1591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wan, X. and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31st ACM Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wan, X., Yang, J., and Xiao, J. 2007. Manifold-ranking based topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2903--2908. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wang, D., Li, T., Zhu, S., and Ding, C. 2008a. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wang, D., Zhu, S., Li, T., Chi, Y., and Gong, Y. 2008b. Integrating clustering and multi-document summarization to improve document understanding. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM). 1435--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wang, D., Zhu, S., Li, T., and Gong, Y. 2009. Multi-document summarization using sentence-based topic models. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wang, F., Zhang, C., and Li, T. 2007. Regularized clustering for documents. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 95--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wei, F., Li, W., Lu, Q., and He, Y. 2008. Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval. ACM, 283--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xu, W. and Gong, Y. 2004. Document clustering by concept factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 373--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yih, W.-T., Goodman, J., Vanderwende, L., and Suzuki, H. 2007. Multi-document summarization by maximizing informative content-words. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1776--1782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zamir, O. and Etzioni, O. 1998. Web document clustering: A feasibility demonstratio. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zha, H., He, X., Ding, C., Gu, M., and Simon, H. 2001. Bipartite graph partitioning and data clustering. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhong, S. and Ghosh, J. 2003. A unified framework for model-based clustering. J. Mach. Learn. Res., 1001--1037. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst., 374--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zien, J., Schlag, M., and Chan, P. K. 1999. Multilevel spectral hypergraph partitioning with artibary vertex sizes. IEEE Trans. Comput.-Aid. Design. 1389--1399. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Integrating Document Clustering and Multidocument Summarization

      Recommendations

      Reviews

      Amos O Olagunju

      Future information retrieval systems require effective document grouping and condensation algorithms to aid users in the interpretation of retrieved documents. The design of such algorithms for generating meaningful and interpretable document summaries poses challenges, despite the available methods for information retrieval systems in the literature [1]. How should retrieved documents be effectively clustered and abridged for meaningful understanding__?__ Wang et al. present a sentence-based factoring language framework for simultaneously grouping and abridging documents. This framework minimizes the divergence between documents and terms to generate document-term and sentence-term matrices from a collection of documents and to produce meaningful document summaries for interpretation. Specifically, the sentence-based factoring language framework assigns documents to topics according to the degree of relevance, and then extracts a summary from those sentences whose topics have the highest probabilities of relevance in the document collection. This framework constructs clusters of documents, derives the probabilities for the documents and sentences, and generates the scores for deciding on the summary of each document cluster. It eliminates formatting characters and common words (nonindex terms) from a document collection prior to generating the document-term and sentence-term matrices. It then uses the sentence-term matrix to perform nonnegative matrix factorization on the document-term matrix to produce document-topic and sentence-topic matrices and consequently to generate the document clusters and summaries. The authors undoubtedly present a reliable computational algorithm for performing nonnegative matrix factorization based on the Dirichlet distribution [2] and parameter estimation by a method of maximum likelihood. They use a synthetic dataset to clearly illustrate how the presented framework functions. They conducted experiments with prototypical document collections to evaluate the effectiveness of the approach in clustering documents and generating summaries. Their framework significantly outperformed well-known algorithms such as k -means, information-theoretic clustering, Euclidean co-clustering and minimum sum-squared co-clustering, and nonnegative matrix factorization in generating accurate document clusters. In addition, it outperformed most recent information retrieval systems that use semantic manifold ranking and analysis to summarize documents. The experimental results convincingly show that the sentence-based factoring language framework is useful for mining the contextual facts rooted in a document collection to produce a meaningful summary of documents. The sentence-based factoring language framework shows promise for extracting the meaning of documents in future information retrieval systems. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 5, Issue 3
        August 2011
        119 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/1993077
        Issue’s Table of Contents

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 August 2011
        • Accepted: 1 July 2010
        • Revised: 1 February 2010
        • Received: 1 June 2009
        Published in tkdd Volume 5, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader