QCS: A system for querying, clustering and summarizing documents

doi:10.1016/j.ipm.2007.01.003

Information Processing & Management

Volume 43, Issue 6, November 2007, Pages 1588-1605

https://doi.org/10.1016/j.ipm.2007.01.003 Get rights and content

Abstract

Information retrieval systems consist of many complicated components. Research and development of such systems is often hampered by the difficulty in evaluating how each particular component would behave across multiple systems. We present a novel integrated information retrieval system—the Query, Cluster, Summarize (QCS) system—which is portable, modular, and permits experimentation with different instantiations of each of the constituent text analysis components. Most importantly, the combination of the three types of methods in the QCS design improves retrievals by providing users more focused information organized by topic.

We demonstrate the improved performance by a series of experiments using standard test sets from the Document Understanding Conferences (DUC) as measured by the best known automatic metric for summarization system evaluation, ROUGE. Although the DUC data and evaluations were originally designed to test multidocument summarization, we developed a framework to extend it to the task of evaluation for each of the three components: query, clustering, and summarization. Under this framework, we then demonstrate that the QCS system (end-to-end) achieves performance as good as or better than the best summarization engines.

Given a query, QCS retrieves relevant documents, separates the retrieved documents into topic clusters, and creates a single summary for each cluster. In the current implementation, Latent Semantic Indexing is used for retrieval, generalized spherical k-means is used for the document clustering, and a method coupling sentence “trimming” and a hidden Markov model, followed by a pivoted QR decomposition, is used to create a single extract summary for each cluster. The user interface is designed to provide access to detailed information in a compact and useful format.

Our system demonstrates the feasibility of assembling an effective IR system from existing software libraries, the usefulness of the modularity of the design, and the value of this particular combination of modules.

Introduction

Information retrieval (IR) systems provide users with a vast amount of reference material. Along with this tremendous access comes the challenge of effectively presenting the user with relevant information in response to a query. When using an IR engine to search through electronic resources, simple queries often return too many documents, including many that are not relevant to the intended search. For instance, there are several million documents on the World Wide Web pertaining to “Michael Jordan.” Most of these concern the basketball star, so it is difficult to find information about the television personality, the jazz musician, the mathematician, or the many others who share that name. It would be useful to have a system that could overcome this limitation.

One approach is to cluster the documents after retrieval and present a synopsis of each cluster so that a user can choose clusters of interest. This is the motivation for our Query, Cluster, Summarize (QCS) system, which performs the following tasks in response to a query:

•
retrieves relevant documents,
•
separates the retrieved documents into clusters by topic, and
•
creates a summary for each cluster.

Our implementation of the QCS system partitions the code into portable modules, making it easy to experiment with different methods for handling the three main tasks listed above. In our current implementation of the QCS system, we use existing software libraries for each task. Throughout this paper, we discuss our choices for each of the modules used, but note that it is possible to exchange individual modules with other methods. Our goal in this work is to produce short summaries (∼100 words) for clusters of documents organized by topic and listed in descending order of relevance to a user’s query.

Many existing retrieval systems focus on organizing individual documents related to a query by using one or more ranking algorithms to order the retrieved documents. For example, Google uses its hyperlink analysis algorithm called PageRank to order documents retrieved from web-based collections (Google, 2006). The combination of link analysis and text-based topic analysis employed by the ExpertRank (Teoma) algorithm in the Ask search engine (Search & Median, 2006) results in implicit topic clustering. In such systems, short extractions of the documents containing one or more of the query terms are displayed to users.

Examples of retrieval systems employing clustering algorithms for organizing sets of retrieved documents include Velocity/Clusty (Vivisimo, 2006), Infonetware/RealTerm (Infogistics, 2001), WiseNut (LookSmart, 2006), Accumo (Accumo, 2006), iBoogie (CyberTavern, 2006), and the KartOO and Ujiko systems (KartOO, 2006). These systems organize the documents into clusters and generate a list of keywords associated with each cluster. The latter two systems also present graphical representations of the resulting clusters. As with the retrieval systems above, these systems also present document extractions containing one or more query terms; the only summary presented is a list of keywords.

Much of the recent research on automatic text summarization systems is available in the proceedings of the 2001–2006 Document Understanding Conferences.¹ The focus of these conferences is the evaluation and discussion of summarization algorithms and systems in performing sets of tasks on several types of document collections. Several tasks included in previous DUC evaluations focused on multidocument summarization for clusters of documents, and our participation in these evaluations led to the development of QCS.

Previous work on using a combination of clustering and summarization to improve IR is summarized in Radev, Fan, and Zhang (2001). Of existing IR systems employing this combination, QCS most resembles the NewsInEssence system (Radev, Blair-Goldensohn, Zhang, & Raghavan, 2001) in that both systems can produce multidocument summaries from document sets clustered by topic. However, NewsInEssence is designed for IR from HTML-linked document sets while QCS has been designed for IR from generic document sets.

Another system that leverages clustering and summarization for information organization similarly to QCS is the Columbia Newsblaster system (McKeown et al., 2002). Newsblaster, like NewsInEssence, is a web-based system which crawls news websites and then clusters and summarizes the news stories, but it does not currently accept queries. Recently, the value of summarization to users in IR has been demonstrated by Maña-López, de Beunaga, and Gómez-Hidalgo (2004), whose study showed increases in user recall of retrieved information when clustering and summarization were included in the output of the IR system.

We have used QCS for information retrieval in two information domains: biomedical abstracts from the US National Library of Medicine’s MEDLINE database, discussed in Dunlavy, Conroy, O’Leary, and O’Leary (2003), and newswire documents from the 2002–2004 DUC evaluations, discussed here.

In Section 2, we discuss our choices for each of the components of the QCS system. An example of use of the QCS system is presented in Section 3. Section 4 presents results of experiments evaluating some of the components of the implementation, and we conclude in Section 5.

Section snippets

The QCS system

QCS is a collection of software modules developed in the languages C and C++ and tested under the operating systems SunOS 5.8 (Solaris 8) and Linux (kernel v2.4). Preprocessing tools for all QCS data, including processing of the data passed from one module to another, were developed in the Perl language. QCS has been developed as a client-server application, and the implementation took approximately 6 person-months of full-time effort.

In this section we describe the components of our system:

Example of the QCS system

We present an example of the entire QCS system from the standpoint of the user. The example uses the query hurricane earthquake in finding documents in the DUC 2002 document collection. The DUC 2002 collection consists of 567 documents and the QCS preprocessing modules identified 7767 unique terms across the collection.

Table 1 shows the highest query similarity scores along with the first “subject” sentence (i.e., first sentence with stype = 0) from each document. In this example, a rank-50

Experiments

In this section, we describe the results of two sets of experiments performed to test QCS on various document collections. We first present the data used in the experiments, then describe our framework for evaluating the performance of QCS and finally present the results of several tests performed within this framework. Tests were performed on a Sun Ultra60 with a 450 MHz processor and 512 Mb of RAM running Solaris 8. Further results for testing QCS, including timings and additional retrieval

Conclusions

QCS is a tool for document retrieval that presents results in a format so that a user can quickly identify a set of documents of interest. The results include a multidocument summary of each cluster of documents, a summary of each individual document, a pointer to each document, and pointers to documents from which the multidocument extract summary was derived. Results of using QCS on the DUC document set illustrate the usefulness of this system; in particular, we provide evidence of the value

Acknowledgements

We thank the authors of LT TTT, GTP, and GMEANS for the use of their code, Timothy O’Leary for his assistance with the MEDLINE data set used in QCS, and Tamara Kolda for her helpful suggestions during the preparation of this manuscript. Daniel Dunlavy was supported in part by the Applied Mathematics Research program of the Office of Advanced Scientific Computing Research of DOEs Office of Science and by Sandia National Laboratories, a multiprogram laboratory operated by Sandia Corporation, a

Daniel M. Dunlavy is a John von Neumann Fellow at Sandia National Laboratories. His research interests include application of numerical linear algebra and optimization methods in the areas of informatics and biological and engineering sciences. He received his BA in Computer Studies from Northwestern University, an MA in Applied Mathematics from Western Michigan University, and an MS and PhD in Applied Mathematics and Scientific Computing from the University of Maryland. He is a member of the

References (32)

Accumo (2006). Accumo – enterprise text categorization – technology. Retrieved December 4, 2006, from...
L.E. Baum et al.
Maximization technique occurring in statistical analysis of probabilistic functions in Markov chains
The Annals of Mathematical Statistics
(1970)
Conroy, J. M., & O’Leary, D. P. (2001). Text summarization via hidden Markov models and pivoted QR matrix...
Conroy, J. M., Schlesinger, J. D., Goldstein, J., & O’Leary, D. P. (2004). Left-brain/right-brain multi-document...
CyberTavern, LLC, (2006). iboogie – metasearch document clustering engine and personalized search engines directory....
Dang, H. T. (2005). Overview of DUC 2005. In Proc. document understanding...
S.C. Deerwester et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)
I.S. Dhillon et al.
Efficient clustering of very large document collections
Dhillon, I. S., Guan, Y., & Kogan, J. (2002). Iterative clustering of high dimensional text data augmented by local...
I.S. Dhillon et al.
Concept decompositions for large sparse text data using clustering
Machine Learning
(2001)

Dunlavy, D. M., Conroy, J. M., O’Leary, T. J., & O’Leary, D. P. (2003). Clustering and summarizing Medline abstracts....

Dunlavy, D. M., Conroy, J. M., Schlesinger, J. D., Goodman, S. A., Okurowski, M. E., O’Leary, D. P., et al. (2003)....

Dunlavy, D. M., O’Leary, D. P., Conroy, J. M., & Schlesinger, J. D. (2006). QCS: a system for querying, clustering and...

T. Dunning

Accurate methods for statistics of surprise and coincidence

Computational Linguistics

(1993)

W.N. Francis et al.

Frequency analysis of English usage: Lexicon and grammar

(1982)

J.T. Giles et al.

GTP (general text parser) software for text mining

Cited by (64)

SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders
2019, Expert Systems with Applications
Citation Excerpt :
The summarization can also be categorized as single-document and multi-document depending on the number of input documents given (Zajic, Dorr, & Lin, 2008; Fattah & Ren, 2009). Similarly, the summarization can be either generic or query-based (Gong & Liu, 2001; Dunlavy, O'Leary, Conroy, & Schlesinger, 2007; Wan, 2008; Ouyang, Li, Li, & Lu, 2011). The generic summarization provides an overall idea of the document content whereas, query-focused summarization presents the relevant content of the document according to the user given queries.
In this paper, we propose SummCoder, a novel methodology for generic extractive text summarization of single documents. The approach generates a summary according to three sentence selection metrics formulated by us: sentence content relevance, sentence novelty, and sentence position relevance. The sentence content relevance is measured using a deep auto-encoder network, and the novelty metric is derived by exploiting the similarity among sentences represented as embeddings in a distributed semantic space. The sentence position relevance metric is a hand-designed feature, which assigns more weight to the first few sentences through a dynamic weight calculation function regulated by the document length. Furthermore, a sentence ranking and a selection technique are developed to generate the document summary by ranking the sentences according to the final score obtained through the fusion of the three sentences selection metrics. We also introduce a new summarization benchmark, Tor Illegal Documents Summarization (TIDSumm) dataset, especially to assist Law Enforcement Agencies (LEAs), that contains two sets of ground truth summaries, manually created, for 100 web documents extracted from onion websites in Tor (The Onion Router) network. Empirical results show that, on DUC 2002, on Blog Summarization, and on TIDSumm datasets, our text summarization approach obtains comparable or better performance than the state-of-the-art methods for different ROUGE metrics.
Extractive single document summarization using multi-objective optimization: Exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm
2019, Knowledge-Based Systems
Text summarization techniques become paramount in extracting relevant information from large databa-ses. Current paper attempts to build some extractive single document text summarization (ESDS) systems using multi-objective optimization (MOO) frameworks. Three techniques are proposed: (1) first is an integration of self-organizing map (SOM) and multi-objective differential evolution (MODE) (named as ESDS_SMODE) (2) second is based on multi-objective grey wolf optimizer (ESDS_MGWO) and (3) third is based on multi-objective water cycle algorithm (ESDS_MWCA). The sentences present in the document are first clustered utilizing the concept of multi-objective clustering. Two objective functions measuring compactness and separation of the sentence clusters in two different ways are optimized simultaneously using MOO framework. The proposed approach is able to automatically detect the number of sentence clusters present in a document and then representative sentences are selected from different clusters using some sentence scoring features to generate the summary. The experiments were conducted on two benchmark datasets, DUC2001, and DUC2002, and the obtained results are compared with various state-of-the-art techniques using ROUGE measures. Results illustrate the superiority of our approach in comparison to state-of-the-art techniques in terms of ROUGE $- 2$ score for both datasets. Code of the developed approach ESDS_SMODE is available online at https://drive.google.com/open?id=1WagTeIDLgphttPrKHpnF_eO7QHWJxXxK.
Extractive multi-document summarization using population-based multicriteria optimization
2017, Expert Systems with Applications
Multi-document summarization is the process of extracting salient information from a set of source texts and present that information to the user in a condensed form. In this paper, we propose a multi-document summarization system which generates an extractive generic summary with maximum relevance and minimum redundancy by representing each sentence of the input document as a vector of words in Proper Noun, Noun, Verb and Adjective set. Five features, such as TF_ISF, Aggregate Cross Sentence Similarity, Title Similarity, Proper Noun and Sentence Length associated with the sentences, are extracted, and scores are assigned to sentences based on these features. Weights that can be assigned to different features may vary depending upon the nature of the document, and it is hard to discover the most appropriate weight for each feature, and this makes generation of a good summary a very tough task without human intelligence. Multi-document summarization problem is having large number of decision parameters and number of possible solutions from which most optimal summary is to be generated. Summary generated may not guarantee the essential quality and may be far from the ideal human generated summary. To address this issue, we propose a population-based multicriteria optimization method with multiple objective functions. Three objective functions are selected to determine an optimal summary, with maximum relevance, diversity, and novelty, from a global population of summaries by considering both the statistical and semantic aspects of the documents. Semantic aspects are considered by Latent Semantic Analysis (LSA) and Non Negative Matrix Factorization (NMF) techniques. Experiments have been performed on DUC 2002, DUC 2004 and DUC 2006 datasets using ROUGE tool kit. Experimental results show that our system outperforms the state of the art works in terms of Recall and Precision.
Extractive multi-document summarization based on textual entailment and sentence compression via knapsack problem
2019, Natural Language Engineering
EXABSUM: a new text summarization approach for generating extractive and abstractive summaries
2023, Journal of Big Data
State-of-the-art approach to extractive text summarization: a comprehensive review
2023, Multimedia Tools and Applications

View all citing articles on Scopus

Dianne P. O’Leary is a professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland. She received her BS in mathematics from Purdue University and her PhD in computer science from Stanford. She has authored over 75 journal articles on computational linear algebra and optimization, algorithms for high-performance computers, numerical solution of ill-posed problems, and scientific computing. She is a fellow of the ACM and a member of SIAM and the Association for Women in Mathematics.

John M. Conroy is a research staff member for the IDA Center for Computing Sciences in Bowie, MD. His research interest is applications of numerical linear algebra. He has published papers in high performance computing, pattern classification, anomaly detection, and text summarization. He received his B.S. in Mathematics from Saint Joseph’s University, and his PhD from the Applied Mathematics Program at the University of Maryland. He is a member of SIAM, IEEE, and the Association for Computational Linguistics (ACL).

Judith D. Schlesinger is a research staff member at the IDA Center for Computing Sciences in Bowie, MD. Her research interests span programming and natural languages, and she has published papers on intelligent tutoring systems, parallel programming, program understanding, and automatic summarization. She received her BS in mathematics from Brooklyn College, CUNY/SUNY, her MS in computer and information science from The Ohio State University, and her PhD in computer science from The Johns Hopkins University. She is a member of the ACM, IEEE Computer Society, and ACL.

View full text

QCS: A system for querying, clustering and summarizing documents

Abstract

Introduction

Section snippets

The QCS system

Example of the QCS system

Experiments

Conclusions

Acknowledgements

Maximization technique occurring in statistical analysis of probabilistic functions in Markov chains

The Annals of Mathematical Statistics

Indexing by latent semantic analysis

Journal of the American Society for Information Science

Efficient clustering of very large document collections

Concept decompositions for large sparse text data using clustering

Machine Learning

Accurate methods for statistics of surprise and coincidence

Computational Linguistics

Frequency analysis of English usage: Lexicon and grammar

GTP (general text parser) software for text mining