ABSTRACT
This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
- S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 2010.Google ScholarDigital Library
- A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowledge bases. In AAAI, 2011.Google ScholarDigital Library
- C. Chen, W. Buntine, N. Ding, L. Xie, and L. Du. Differential topic models. TPAMI, 37(2):230--242, 2015. Google ScholarCross Ref
- J. M. Conroy and D. P. O'leary. Text summarization via hidden markov models. In SIGIR, 2001. Google ScholarDigital Library
- J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. Multi-document summarization by sentence extraction. In NAACL-ANLP Workshop on Automatic summarization, 2000. Google ScholarDigital Library
- Y. Gong and X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In SIGIR, 2001. Google ScholarDigital Library
- A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In NAACL, 2009. Google ScholarCross Ref
- Z. S. Harris. Distributional structure. Word, 1954.Google Scholar
- T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. TKDE, 15(4):784--796, 2003. Google ScholarDigital Library
- J. He, H. Tong, Q. Mei, and B. Szymanski. Gender: A generic diversified ranking algorithm. In NIPS, 2012.Google Scholar
- G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002. Google ScholarDigital Library
- X. Huang, X. Wan, and J. Xiao. Comparative news summarization using linear programming. In ACL, 2011.Google Scholar
- X. Huang, X. Wan, and J. Xiao. Comparative news summarization using concept-based optimization. Knowledge and information systems, 38(3):691--716, 2014. Google ScholarCross Ref
- G. Jeh and J. Widom. Scaling personalized web search. In WWW, 2003. Google ScholarDigital Library
- N. Jindal and B. Liu. Identifying comparative sentences in text documents. In SIGIR, 2006. Google ScholarDigital Library
- N. Jindal and B. Liu. Mining comparative sentences and relations. In AAAI, 2006.Google ScholarDigital Library
- H. Kim, J. Choo, J. Kim, C. K. Reddy, and H. Park. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In KDD, 2015. Google ScholarDigital Library
- H. D. Kim and C. Zhai. Generating comparative summaries of contradictory opinions in text. In CIKM, 2009. Google ScholarDigital Library
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.Google ScholarDigital Library
- K. Lerman and R. McDonald. Contrastive summarization: an experiment with consumer reviews. In NAACL, 2009. Google ScholarCross Ref
- C.-Y. Lin and E. Hovy. From single to multi-document summarization: A prototype system and its evaluation. In ACL, 2002.Google Scholar
- J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarDigital Library
- Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In WWW, 2008. Google ScholarDigital Library
- A. S. Maiya. A framework for comparing groups of documents. EMNLP, 2015. Google ScholarCross Ref
- I. Mani and E. Bloedorn. Multi-document summarization by graph search and matching. AAAI, 1997.Google ScholarDigital Library
- C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarCross Ref
- Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In SIGKDD, 2007. Google ScholarDigital Library
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
- M. J. Paul, C. Zhai, and R. Girju. Summarizing contrastive viewpoints in opinionated text. In EMNLP, 2010.Google ScholarDigital Library
- D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, 2000.Google Scholar
- D. R. Radev, H. Jing, and M. Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In NAACL, 2000.Google Scholar
- X. Ren, W. He, M. Qu, H. Ji, C. R. Voss, and J. Han. Label noise reduction in entity typing by heterogeneous partial-label embedding. In KDD, 2016. Google ScholarDigital Library
- X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, and J. Han. ClusCite: effective citation recommendation by information network-based clustering. In KDD, 2014. Google ScholarDigital Library
- C. Shen and T. Li. Multi-document summarization via the minimum dominating set. In COLING, 2010.Google ScholarDigital Library
- R. Sipos and T. Joachims. Generating comparative summaries from reviews. In CIKM, 2013. Google ScholarDigital Library
- M. Tkachenko and H. W. Lauw. Generative modeling of entity comparisons in text. In CIKM, 2014. Google ScholarDigital Library
- P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. JOTA, 109(3):475--494, 2001. Google ScholarDigital Library
- X. Wan, H. Jia, S. Huang, and J. Xiao. Summarizing the differences in multilingual news. In SIGIR, 2011. Google ScholarDigital Library
- D. Wang, S. Zhu, T. Li, and Y. Gong. Comparative document summarization via discriminative sentence selection. TKDD, 6(3):12, 2013. Google ScholarDigital Library
- S. Wang, Z. Chen, and B. Liu. Mining aspect-specific opinion using a holistic lifelong topic model. In WWW, 2016. Google ScholarDigital Library
- X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: a heterogeneous information network approach. In WSDM, 2014. Google ScholarDigital Library
- C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In SIGKDD, 2004. Google ScholarDigital Library
- L. Zhang, L. Li, C. Shen, and T. Li. Patentcom: A comparative view of patent document retrieval. SDM, 2015.Google ScholarCross Ref
- Z. Zhang. A comparative evaluation of term recognition algorithms. In LERC, 2008.Google Scholar
- D. Zhou, J. Weston, A. Gretton, and O. Bousquet. Ranking on data manifolds. NIPS, 2004.Google Scholar
- X. Zhu, J. Lafferty, and R. Rosenfeld. Semi-supervised learning with graphs. Carnegie Mellon University, 2005.Google Scholar
- F. Zhuang, P. Luo, Z. Shen, Q. He, Y. Xiong, Z. Shi, and H. Xiong. Mining distinction and commonality across multiple domains using generative model for text classification. TKDE, 24(11):2025--2039, 2012. Google ScholarDigital Library
Index Terms
- Comparative Document Analysis for Large Text Corpora
Recommendations
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningIntegrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
A scaleable document clustering approach for large document corpora
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalIn information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
Comments