Abstract
Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.
- Blei, D. M., Ng, A. Y., and Jordan, M. I. 2002. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani Eds. MIT Press, Cambridge, MA, 601--608.Google Scholar
- Cho, H., Dhillon, I., Guan, Y., and Sra, S. 2004. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of SIAM International Conference on Data Mining.Google Scholar
- Conroy, J. and O’Leary, D. 2001. Text summarization via hidden Markov models. In Proceedings of SIGIR. 406--407. Google ScholarDigital Library
- Devore, J. and Peck, R. 1977. Statistics: The Exploration and Analysis of Data. Duxbury Press.Google Scholar
- Dhillon, I. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 269--274. Google ScholarDigital Library
- Dhillon, I., Mallela, S., and Modha, S. 2001. Information-theoretic co-clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 89--98. Google ScholarDigital Library
- Ding, C., He, X., Zha, H., Gu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM). 107--114. Google ScholarDigital Library
- Ding, C., Li, T., Peng, W., and Park, H. 2006. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM (SIGKDD) International Conference on Knowledge Discovery and Data Mining. 126--135. Google ScholarDigital Library
- Duda, R., Hart, P., and Stork, D. 2001. Pattern Classification. John Wiley and Sons, Inc. Google ScholarDigital Library
- Dunlavy, D., O’Leary, D., Conroy, J., and Schlesinger, J. 2007. QCS: A system for querying, clustering and summarizing documents. Inform. Process. Manag. Int. J. Google ScholarDigital Library
- Elkan, C. 2006. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In Proceedings of International Conference on Machine Learning (ICML). 289--296. Google ScholarDigital Library
- Erkan, G. and Radev, D. 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of International Conference on Empirical Method on Natural Language Processing (EMNLP).Google Scholar
- Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 121--128. Google ScholarDigital Library
- Gong, Y. and Liu, X. 2001. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 75--95. Google ScholarDigital Library
- He, J., Lan, M., Tan, C., Sung, S., and Low, H. 2004. Initialization of cluster reffinement algorithms: A review and comparative study. In Proceedings of International Joint Conference on Neural Networks (IJCNN).Google Scholar
- Hoffman, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
- Jing, H. and McKeown, K. 2000. Cut and paste based text summarization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Google ScholarDigital Library
- Jing, L., Ng, M. K., and Huang, J. Z. 2007. An entropy weighting k-means algorithm for subspace clustering of high dimensional sparse data. IEEE Trans. Knowl. Data Eng. 1026--1041. Google ScholarDigital Library
- Knight, K. and Marcu, D. 2002. Summarization beyond sentence extraction: A probablistic approach to sentence compression. Artif. Intell. 91--107. Google ScholarDigital Library
- Lee, D. D. and Seung, H. S. 2001. Algorithms for non-negative matrix factorization. In Proceedings of the Conference on Neural Information Processing Systems (NIPS).Google Scholar
- Li, T. 2005. A general model for clustering binary data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 188--197. Google ScholarDigital Library
- Li, T. and Ding, C. 2006. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the IEEE International Conference on Data Mining. 362--371. Google ScholarDigital Library
- Li, T., Ma, S., and Ogihara, M. 2004. Document clustering via adaptive subspace iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 218--225. Google ScholarDigital Library
- Lin, C.-Y. and Hovy, E. 2001. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of Association for Computational Linguistics (ACL). 457--464. Google ScholarDigital Library
- Lin, C.-Y. and Hovy, E. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NLT-NAACL). 71--78. Google ScholarDigital Library
- Liu, X., Gong, Y., Xu, W., and Zhu, S. 2003. Document clustering with cluster refinement and model selection capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development on Information Retrieval. 191--198. Google ScholarDigital Library
- Long, B., Wu, X., Zhang, Z. M., and Yu, P. S. 2006. Unsupervised learning on k-partite graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 317--326. Google ScholarDigital Library
- Long, C., Huang, M., Zhu, X., and Li, M. 2009. Multi-document summarization by information distance. In Proceedings of International Conference on Data Mining (ICDM). Google ScholarDigital Library
- Mana-Lopez, M. J., Buenaga, M. D., and Gomez-Hidalgo, J. M. 2004. Multidocument summarization: An added value to clustering in interactive retrieval. ACM Trans. Inform. Syst. Google ScholarDigital Library
- Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company.Google Scholar
- McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., and Sigelman, S. 2002. Tracking and summarizing news on a daily basis with Columbia’s newsblaster. In Proceedings of the 2nd International Conference on Human Language Technology Research. Google ScholarDigital Library
- Mihalcea, R. and Tarau, P. 2005. A language independent algorithm for single and multiple document summarization. In Proceedings of International Conference on Natural Language Processing (IJCNLP).Google Scholar
- Nastase, V. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP). 763--772. Google ScholarDigital Library
- Park, S., Lee, J.-H., Kim, D.-H., and Ahn, C.-M. 2007. Multi-document summarization based on cluster using non-negtive matrix factorization. In Proceedings of Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM). Google ScholarDigital Library
- Radev, D., Jing, H., Stys, M., and Tam, D. 2004. Centroid-based summarization of multiple documents. Inform. Process. Manag., 919--938. Google ScholarDigital Library
- Ricardo, B. and Berthier, R. 1999. Modern Information Retrieval. ACM Press. Google ScholarDigital Library
- Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. 2007. Document summarization using conditional random fields. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2862--2867. Google ScholarDigital Library
- Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 888--905. Google ScholarDigital Library
- Strehl, A. and Ghosh, J. 2003. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 583--617. Google ScholarDigital Library
- Tang, J., Yao, L., and Chen, D. 2009. Multi-topic based query-oriented Summarization. In Proceedings of SIAM International Conference on Data Mining (SDM).Google Scholar
- Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. 2007. Fast generation of result snippets in Web search. In Prodeedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 127--134. Google ScholarDigital Library
- Wan, X. and Xiao, J. 2009. Graph-based multi-modality learning for topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1586--1591. Google ScholarDigital Library
- Wan, X. and Yang, J. 2008. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31st ACM Annual International SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
- Wan, X., Yang, J., and Xiao, J. 2007. Manifold-ranking based topic-focused multi-document summarization. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 2903--2908. Google ScholarDigital Library
- Wang, D., Li, T., Zhu, S., and Ding, C. 2008a. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. Google ScholarDigital Library
- Wang, D., Zhu, S., Li, T., Chi, Y., and Gong, Y. 2008b. Integrating clustering and multi-document summarization to improve document understanding. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM). 1435--1436. Google ScholarDigital Library
- Wang, D., Zhu, S., Li, T., and Gong, Y. 2009. Multi-document summarization using sentence-based topic models. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL’09). Google ScholarDigital Library
- Wang, F., Zhang, C., and Li, T. 2007. Regularized clustering for documents. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 95--102. Google ScholarDigital Library
- Wei, F., Li, W., Lu, Q., and He, Y. 2008. Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development on Information Retrieval. ACM, 283--290. Google ScholarDigital Library
- Xu, W. and Gong, Y. 2004. Document clustering by concept factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 202--209. Google ScholarDigital Library
- Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 373--386. Google ScholarDigital Library
- Yih, W.-T., Goodman, J., Vanderwende, L., and Suzuki, H. 2007. Multi-document summarization by maximizing informative content-words. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). 1776--1782. Google ScholarDigital Library
- Zamir, O. and Etzioni, O. 1998. Web document clustering: A feasibility demonstratio. In Proceedings of ACM SIGIR Conference on Research and Development on Information Retrieval. 46--54. Google ScholarDigital Library
- Zha, H., He, X., Ding, C., Gu, M., and Simon, H. 2001. Bipartite graph partitioning and data clustering. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). 25--32. Google ScholarDigital Library
- Zhong, S. and Ghosh, J. 2003. A unified framework for model-based clustering. J. Mach. Learn. Res., 1001--1037. Google ScholarDigital Library
- Zhong, S. and Ghosh, J. 2005. Generative model-based document clustering: A comparative study. Knowl. Inf. Syst., 374--384. Google ScholarDigital Library
- Zien, J., Schlag, M., and Chan, P. K. 1999. Multilevel spectral hypergraph partitioning with artibary vertex sizes. IEEE Trans. Comput.-Aid. Design. 1389--1399. Google ScholarDigital Library
Index Terms
- Integrating Document Clustering and Multidocument Summarization
Recommendations
Integrating clustering and multi-document summarization to improve document understanding
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementDocument understanding techniques such as document clustering and multi-document summarization have been receiving much attention in recent years. Current document clustering methods usually represent documents as a term-document matrix and perform ...
Multidocument summarization: An added value to clustering in interactive retrieval
A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar ...
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchThis paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Comments