QCS: A system for querying, clustering and summarizing documents
Introduction
Information retrieval (IR) systems provide users with a vast amount of reference material. Along with this tremendous access comes the challenge of effectively presenting the user with relevant information in response to a query. When using an IR engine to search through electronic resources, simple queries often return too many documents, including many that are not relevant to the intended search. For instance, there are several million documents on the World Wide Web pertaining to “Michael Jordan.” Most of these concern the basketball star, so it is difficult to find information about the television personality, the jazz musician, the mathematician, or the many others who share that name. It would be useful to have a system that could overcome this limitation.
One approach is to cluster the documents after retrieval and present a synopsis of each cluster so that a user can choose clusters of interest. This is the motivation for our Query, Cluster, Summarize (QCS) system, which performs the following tasks in response to a query:
- •
retrieves relevant documents,
- •
separates the retrieved documents into clusters by topic, and
- •
creates a summary for each cluster.
Our implementation of the QCS system partitions the code into portable modules, making it easy to experiment with different methods for handling the three main tasks listed above. In our current implementation of the QCS system, we use existing software libraries for each task. Throughout this paper, we discuss our choices for each of the modules used, but note that it is possible to exchange individual modules with other methods. Our goal in this work is to produce short summaries (∼100 words) for clusters of documents organized by topic and listed in descending order of relevance to a user’s query.
Many existing retrieval systems focus on organizing individual documents related to a query by using one or more ranking algorithms to order the retrieved documents. For example, Google uses its hyperlink analysis algorithm called PageRank to order documents retrieved from web-based collections (Google, 2006). The combination of link analysis and text-based topic analysis employed by the ExpertRank (Teoma) algorithm in the Ask search engine (Search & Median, 2006) results in implicit topic clustering. In such systems, short extractions of the documents containing one or more of the query terms are displayed to users.
Examples of retrieval systems employing clustering algorithms for organizing sets of retrieved documents include Velocity/Clusty (Vivisimo, 2006), Infonetware/RealTerm (Infogistics, 2001), WiseNut (LookSmart, 2006), Accumo (Accumo, 2006), iBoogie (CyberTavern, 2006), and the KartOO and Ujiko systems (KartOO, 2006). These systems organize the documents into clusters and generate a list of keywords associated with each cluster. The latter two systems also present graphical representations of the resulting clusters. As with the retrieval systems above, these systems also present document extractions containing one or more query terms; the only summary presented is a list of keywords.
Much of the recent research on automatic text summarization systems is available in the proceedings of the 2001–2006 Document Understanding Conferences.1 The focus of these conferences is the evaluation and discussion of summarization algorithms and systems in performing sets of tasks on several types of document collections. Several tasks included in previous DUC evaluations focused on multidocument summarization for clusters of documents, and our participation in these evaluations led to the development of QCS.
Previous work on using a combination of clustering and summarization to improve IR is summarized in Radev, Fan, and Zhang (2001). Of existing IR systems employing this combination, QCS most resembles the NewsInEssence system (Radev, Blair-Goldensohn, Zhang, & Raghavan, 2001) in that both systems can produce multidocument summaries from document sets clustered by topic. However, NewsInEssence is designed for IR from HTML-linked document sets while QCS has been designed for IR from generic document sets.
Another system that leverages clustering and summarization for information organization similarly to QCS is the Columbia Newsblaster system (McKeown et al., 2002). Newsblaster, like NewsInEssence, is a web-based system which crawls news websites and then clusters and summarizes the news stories, but it does not currently accept queries. Recently, the value of summarization to users in IR has been demonstrated by Maña-López, de Beunaga, and Gómez-Hidalgo (2004), whose study showed increases in user recall of retrieved information when clustering and summarization were included in the output of the IR system.
We have used QCS for information retrieval in two information domains: biomedical abstracts from the US National Library of Medicine’s MEDLINE database, discussed in Dunlavy, Conroy, O’Leary, and O’Leary (2003), and newswire documents from the 2002–2004 DUC evaluations, discussed here.
In Section 2, we discuss our choices for each of the components of the QCS system. An example of use of the QCS system is presented in Section 3. Section 4 presents results of experiments evaluating some of the components of the implementation, and we conclude in Section 5.
Section snippets
The QCS system
QCS is a collection of software modules developed in the languages C and C++ and tested under the operating systems SunOS 5.8 (Solaris 8) and Linux (kernel v2.4). Preprocessing tools for all QCS data, including processing of the data passed from one module to another, were developed in the Perl language. QCS has been developed as a client-server application, and the implementation took approximately 6 person-months of full-time effort.
In this section we describe the components of our system:
Example of the QCS system
We present an example of the entire QCS system from the standpoint of the user. The example uses the query hurricane earthquake in finding documents in the DUC 2002 document collection. The DUC 2002 collection consists of 567 documents and the QCS preprocessing modules identified 7767 unique terms across the collection.
Table 1 shows the highest query similarity scores along with the first “subject” sentence (i.e., first sentence with stype = 0) from each document. In this example, a rank-50
Experiments
In this section, we describe the results of two sets of experiments performed to test QCS on various document collections. We first present the data used in the experiments, then describe our framework for evaluating the performance of QCS and finally present the results of several tests performed within this framework. Tests were performed on a Sun Ultra60 with a 450 MHz processor and 512 Mb of RAM running Solaris 8. Further results for testing QCS, including timings and additional retrieval
Conclusions
QCS is a tool for document retrieval that presents results in a format so that a user can quickly identify a set of documents of interest. The results include a multidocument summary of each cluster of documents, a summary of each individual document, a pointer to each document, and pointers to documents from which the multidocument extract summary was derived. Results of using QCS on the DUC document set illustrate the usefulness of this system; in particular, we provide evidence of the value
Acknowledgements
We thank the authors of LT TTT, GTP, and GMEANS for the use of their code, Timothy O’Leary for his assistance with the MEDLINE data set used in QCS, and Tamara Kolda for her helpful suggestions during the preparation of this manuscript. Daniel Dunlavy was supported in part by the Applied Mathematics Research program of the Office of Advanced Scientific Computing Research of DOEs Office of Science and by Sandia National Laboratories, a multiprogram laboratory operated by Sandia Corporation, a
Daniel M. Dunlavy is a John von Neumann Fellow at Sandia National Laboratories. His research interests include application of numerical linear algebra and optimization methods in the areas of informatics and biological and engineering sciences. He received his BA in Computer Studies from Northwestern University, an MA in Applied Mathematics from Western Michigan University, and an MS and PhD in Applied Mathematics and Scientific Computing from the University of Maryland. He is a member of the
References (32)
- Accumo (2006). Accumo – enterprise text categorization – technology. Retrieved December 4, 2006, from...
- et al.
Maximization technique occurring in statistical analysis of probabilistic functions in Markov chains
The Annals of Mathematical Statistics
(1970) - Conroy, J. M., & O’Leary, D. P. (2001). Text summarization via hidden Markov models and pivoted QR matrix...
- Conroy, J. M., Schlesinger, J. D., Goldstein, J., & O’Leary, D. P. (2004). Left-brain/right-brain multi-document...
- CyberTavern, LLC, (2006). iboogie – metasearch document clustering engine and personalized search engines directory....
- Dang, H. T. (2005). Overview of DUC 2005. In Proc. document understanding...
- et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990) - et al.
Efficient clustering of very large document collections
- Dhillon, I. S., Guan, Y., & Kogan, J. (2002). Iterative clustering of high dimensional text data augmented by local...
- et al.
Concept decompositions for large sparse text data using clustering
Machine Learning
(2001)
Accurate methods for statistics of surprise and coincidence
Computational Linguistics
Frequency analysis of English usage: Lexicon and grammar
GTP (general text parser) software for text mining
Cited by (64)
SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders
2019, Expert Systems with ApplicationsCitation Excerpt :The summarization can also be categorized as single-document and multi-document depending on the number of input documents given (Zajic, Dorr, & Lin, 2008; Fattah & Ren, 2009). Similarly, the summarization can be either generic or query-based (Gong & Liu, 2001; Dunlavy, O'Leary, Conroy, & Schlesinger, 2007; Wan, 2008; Ouyang, Li, Li, & Lu, 2011). The generic summarization provides an overall idea of the document content whereas, query-focused summarization presents the relevant content of the document according to the user given queries.
Extractive multi-document summarization using population-based multicriteria optimization
2017, Expert Systems with ApplicationsExtractive multi-document summarization based on textual entailment and sentence compression via knapsack problem
2019, Natural Language EngineeringEXABSUM: a new text summarization approach for generating extractive and abstractive summaries
2023, Journal of Big DataState-of-the-art approach to extractive text summarization: a comprehensive review
2023, Multimedia Tools and Applications
Daniel M. Dunlavy is a John von Neumann Fellow at Sandia National Laboratories. His research interests include application of numerical linear algebra and optimization methods in the areas of informatics and biological and engineering sciences. He received his BA in Computer Studies from Northwestern University, an MA in Applied Mathematics from Western Michigan University, and an MS and PhD in Applied Mathematics and Scientific Computing from the University of Maryland. He is a member of the Society for Industrial and Applied Mathematics (SIAM).
Dianne P. O’Leary is a professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland. She received her BS in mathematics from Purdue University and her PhD in computer science from Stanford. She has authored over 75 journal articles on computational linear algebra and optimization, algorithms for high-performance computers, numerical solution of ill-posed problems, and scientific computing. She is a fellow of the ACM and a member of SIAM and the Association for Women in Mathematics.
John M. Conroy is a research staff member for the IDA Center for Computing Sciences in Bowie, MD. His research interest is applications of numerical linear algebra. He has published papers in high performance computing, pattern classification, anomaly detection, and text summarization. He received his B.S. in Mathematics from Saint Joseph’s University, and his PhD from the Applied Mathematics Program at the University of Maryland. He is a member of SIAM, IEEE, and the Association for Computational Linguistics (ACL).
Judith D. Schlesinger is a research staff member at the IDA Center for Computing Sciences in Bowie, MD. Her research interests span programming and natural languages, and she has published papers on intelligent tutoring systems, parallel programming, program understanding, and automatic summarization. She received her BS in mathematics from Brooklyn College, CUNY/SUNY, her MS in computer and information science from The Ohio State University, and her PhD in computer science from The Johns Hopkins University. She is a member of the ACM, IEEE Computer Society, and ACL.