Comparative Document Analysis for Large Text Corpora

Authors:
Xiang Ren

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Yuanhua Lv

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Kuansan Wang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data MiningFebruary 2017Pages 325–334https://doi.org/10.1145/3018661.3018690

Published:02 February 2017Publication History

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

Pages 325–334

ABSTRACT

This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains---scientific papers and news---demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.

References

S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. VLDB, 2010.Google ScholarDigital Library
A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowledge bases. In AAAI, 2011.Google ScholarDigital Library
C. Chen, W. Buntine, N. Ding, L. Xie, and L. Du. Differential topic models. TPAMI, 37(2):230--242, 2015. Google ScholarCross Ref
J. M. Conroy and D. P. O'leary. Text summarization via hidden markov models. In SIGIR, 2001. Google ScholarDigital Library
J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. Multi-document summarization by sentence extraction. In NAACL-ANLP Workshop on Automatic summarization, 2000. Google ScholarDigital Library
Y. Gong and X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In SIGIR, 2001. Google ScholarDigital Library
A. Haghighi and L. Vanderwende. Exploring content models for multi-document summarization. In NAACL, 2009. Google ScholarCross Ref
Z. S. Harris. Distributional structure. Word, 1954.Google Scholar
T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. TKDE, 15(4):784--796, 2003. Google ScholarDigital Library
J. He, H. Tong, Q. Mei, and B. Szymanski. Gender: A generic diversified ranking algorithm. In NIPS, 2012.Google Scholar
G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002. Google ScholarDigital Library
X. Huang, X. Wan, and J. Xiao. Comparative news summarization using linear programming. In ACL, 2011.Google Scholar
X. Huang, X. Wan, and J. Xiao. Comparative news summarization using concept-based optimization. Knowledge and information systems, 38(3):691--716, 2014. Google ScholarCross Ref
G. Jeh and J. Widom. Scaling personalized web search. In WWW, 2003. Google ScholarDigital Library
N. Jindal and B. Liu. Identifying comparative sentences in text documents. In SIGIR, 2006. Google ScholarDigital Library
N. Jindal and B. Liu. Mining comparative sentences and relations. In AAAI, 2006.Google ScholarDigital Library
H. Kim, J. Choo, J. Kim, C. K. Reddy, and H. Park. Simultaneous discovery of common and discriminative topics via joint nonnegative matrix factorization. In KDD, 2015. Google ScholarDigital Library
H. D. Kim and C. Zhai. Generating comparative summaries of contradictory opinions in text. In CIKM, 2009. Google ScholarDigital Library
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2001.Google ScholarDigital Library
K. Lerman and R. McDonald. Contrastive summarization: an experiment with consumer reviews. In NAACL, 2009. Google ScholarCross Ref
C.-Y. Lin and E. Hovy. From single to multi-document summarization: A prototype system and its evaluation. In ACL, 2002.Google Scholar
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarDigital Library
Y. Lu and C. Zhai. Opinion integration through semi-supervised topic modeling. In WWW, 2008. Google ScholarDigital Library
A. S. Maiya. A framework for comparing groups of documents. EMNLP, 2015. Google ScholarCross Ref
I. Mani and E. Bloedorn. Multi-document summarization by graph search and matching. AAAI, 1997.Google ScholarDigital Library
C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval. Cambridge university press, 2008. Google ScholarCross Ref
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In SIGKDD, 2007. Google ScholarDigital Library
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
M. J. Paul, C. Zhai, and R. Girju. Summarizing contrastive viewpoints in opinionated text. In EMNLP, 2010.Google ScholarDigital Library
D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, 2000.Google Scholar
D. R. Radev, H. Jing, and M. Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In NAACL, 2000.Google Scholar
X. Ren, W. He, M. Qu, H. Ji, C. R. Voss, and J. Han. Label noise reduction in entity typing by heterogeneous partial-label embedding. In KDD, 2016. Google ScholarDigital Library
X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, and J. Han. ClusCite: effective citation recommendation by information network-based clustering. In KDD, 2014. Google ScholarDigital Library
C. Shen and T. Li. Multi-document summarization via the minimum dominating set. In COLING, 2010.Google ScholarDigital Library
R. Sipos and T. Joachims. Generating comparative summaries from reviews. In CIKM, 2013. Google ScholarDigital Library
M. Tkachenko and H. W. Lauw. Generative modeling of entity comparisons in text. In CIKM, 2014. Google ScholarDigital Library
P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. JOTA, 109(3):475--494, 2001. Google ScholarDigital Library
X. Wan, H. Jia, S. Huang, and J. Xiao. Summarizing the differences in multilingual news. In SIGIR, 2011. Google ScholarDigital Library
D. Wang, S. Zhu, T. Li, and Y. Gong. Comparative document summarization via discriminative sentence selection. TKDD, 6(3):12, 2013. Google ScholarDigital Library
S. Wang, Z. Chen, and B. Liu. Mining aspect-specific opinion using a holistic lifelong topic model. In WWW, 2016. Google ScholarDigital Library
X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han. Personalized entity recommendation: a heterogeneous information network approach. In WSDM, 2014. Google ScholarDigital Library
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In SIGKDD, 2004. Google ScholarDigital Library
L. Zhang, L. Li, C. Shen, and T. Li. Patentcom: A comparative view of patent document retrieval. SDM, 2015.Google ScholarCross Ref
Z. Zhang. A comparative evaluation of term recognition algorithms. In LERC, 2008.Google Scholar
D. Zhou, J. Weston, A. Gretton, and O. Bousquet. Ranking on data manifolds. NIPS, 2004.Google Scholar
X. Zhu, J. Lafferty, and R. Rosenfeld. Semi-supervised learning with graphs. Carnegie Mellon University, 2005.Google Scholar
F. Zhuang, P. Luo, Z. Shen, Q. He, Y. Xiong, Z. Shi, and H. Xiong. Mining distinction and commonality across multiple domains using generative model for text classification. TKDE, 24(11):2025--2039, 2012. Google ScholarDigital Library

Index Terms

Comparative Document Analysis for Large Text Corpora
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Read More
A scaleable document clustering approach for large document corpora

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which ...
Read More
Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining
February 2017
868 pages
ISBN:9781450346757
DOI:10.1145/3018661
General Chairs:
Maarten de Rijke
University of Amsterdam
,
Milad Shokouhi
Microsoft
,
Program Chairs:
Andrew Tomkins
Google
,
Min Zhang
Tsinghua University
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic summarization
commonalities
comparative document analysis
distinction
large-scale text processing
massive text corpus
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '17 Paper Acceptance Rate80of505submissions,16%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 915
  Total Downloads
- Downloads (Last 12 months)103
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative Document Analysis for Large Text Corpora

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining comparable bilingual text corpora for cross-language information integration

A scaleable document clustering approach for large document corpora

Morphologically Annotated Amharic Text Corpora