skip to main content
10.1145/3018661.3018690acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Public Access

Comparative Document Analysis for Large Text Corpora

Authors Info & Claims
Published:02 February 2017Publication History

ABSTRACT

This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.

References

  1. S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowledge bases. In AAAI, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Chen, W. Buntine, N. Ding, L. Xie, and L. Du. Differential topic models. TPAMI, 37(2):230--242, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  4. J. M. Conroy and D. P. O'leary. Text summarization via hidden markov models. In SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. Multi-document summarization by sentence extraction. In NAACL-ANLP Workshop on Automatic summarization, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Gong and X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In NAACL, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  8. Z. S. Harris. Distributional structure. Word, 1954.Google ScholarGoogle Scholar
  9. T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. TKDE, 15(4):784--796, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. He, H. Tong, Q. Mei, and B. Szymanski. Gender: A generic diversified ranking algorithm. In NIPS, 2012.Google ScholarGoogle Scholar
  11. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Huang, X. Wan, and J. Xiao. Comparative news summarization using linear programming. In ACL, 2011.Google ScholarGoogle Scholar
  13. X. Huang, X. Wan, and J. Xiao. Comparative news summarization using concept-based optimization. Knowledge and information systems, 38(3):691--716, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  14. G. Jeh and J. Widom. Scaling personalized web search. In WWW, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Jindal and B. Liu. Identifying comparative sentences in text documents. In SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Jindal and B. Liu. Mining comparative sentences and relations. In AAAI, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Kim, J. Choo, J. Kim, C. K. Reddy, and H. Park. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In KDD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. D. Kim and C. Zhai. Generating comparative summaries of contradictory opinions in text. In CIKM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Lerman and R. McDonald. Contrastive summarization: an experiment with consumer reviews. In NAACL, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  21. C.-Y. Lin and E. Hovy. From single to multi-document summarization: A prototype system and its evaluation. In ACL, 2002.Google ScholarGoogle Scholar
  22. J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In WWW, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. S. Maiya. A framework for comparing groups of documents. EMNLP, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  25. I. Mani and E. Bloedorn. Multi-document summarization by graph search and matching. AAAI, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  27. Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In SIGKDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. J. Paul, C. Zhai, and R. Girju. Summarizing contrastive viewpoints in opinionated text. In EMNLP, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, 2000.Google ScholarGoogle Scholar
  31. D. R. Radev, H. Jing, and M. Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In NAACL, 2000.Google ScholarGoogle Scholar
  32. X. Ren, W. He, M. Qu, H. Ji, C. R. Voss, and J. Han. Label noise reduction in entity typing by heterogeneous partial-label embedding. In KDD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, and J. Han. ClusCite: effective citation recommendation by information network-based clustering. In KDD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Shen and T. Li. Multi-document summarization via the minimum dominating set. In COLING, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Sipos and T. Joachims. Generating comparative summaries from reviews. In CIKM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Tkachenko and H. W. Lauw. Generative modeling of entity comparisons in text. In CIKM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. JOTA, 109(3):475--494, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Wan, H. Jia, S. Huang, and J. Xiao. Summarizing the differences in multilingual news. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Wang, S. Zhu, T. Li, and Y. Gong. Comparative document summarization via discriminative sentence selection. TKDD, 6(3):12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Wang, Z. Chen, and B. Liu. Mining aspect-specific opinion using a holistic lifelong topic model. In WWW, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: a heterogeneous information network approach. In WSDM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In SIGKDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Zhang, L. Li, C. Shen, and T. Li. Patentcom: A comparative view of patent document retrieval. SDM, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  44. Z. Zhang. A comparative evaluation of term recognition algorithms. In LERC, 2008.Google ScholarGoogle Scholar
  45. D. Zhou, J. Weston, A. Gretton, and O. Bousquet. Ranking on data manifolds. NIPS, 2004.Google ScholarGoogle Scholar
  46. X. Zhu, J. Lafferty, and R. Rosenfeld. Semi-supervised learning with graphs. Carnegie Mellon University, 2005.Google ScholarGoogle Scholar
  47. F. Zhuang, P. Luo, Z. Shen, Q. He, Y. Xiong, Z. Shi, and H. Xiong. Mining distinction and commonality across multiple domains using generative model for text classification. TKDE, 24(11):2025--2039, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Comparative Document Analysis for Large Text Corpora

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader