QCS: A system for querying, clustering and summarizing documents

https://doi.org/10.1016/j.ipm.2007.01.003Get rights and content

Abstract

Information retrieval systems consist of many complicated components. Research and development of such systems is often hampered by the difficulty in evaluating how each particular component would behave across multiple systems. We present a novel integrated information retrieval system—the Query, Cluster, Summarize (QCS) system—which is portable, modular, and permits experimentation with different instantiations of each of the constituent text analysis components. Most importantly, the combination of the three types of methods in the QCS design improves retrievals by providing users more focused information organized by topic.

We demonstrate the improved performance by a series of experiments using standard test sets from the Document Understanding Conferences (DUC) as measured by the best known automatic metric for summarization system evaluation, ROUGE. Although the DUC data and evaluations were originally designed to test multidocument summarization, we developed a framework to extend it to the task of evaluation for each of the three components: query, clustering, and summarization. Under this framework, we then demonstrate that the QCS system (end-to-end) achieves performance as good as or better than the best summarization engines.

Given a query, QCS retrieves relevant documents, separates the retrieved documents into topic clusters, and creates a single summary for each cluster. In the current implementation, Latent Semantic Indexing is used for retrieval, generalized spherical k-means is used for the document clustering, and a method coupling sentence “trimming” and a hidden Markov model, followed by a pivoted QR decomposition, is used to create a single extract summary for each cluster. The user interface is designed to provide access to detailed information in a compact and useful format.

Our system demonstrates the feasibility of assembling an effective IR system from existing software libraries, the usefulness of the modularity of the design, and the value of this particular combination of modules.

Introduction

Information retrieval (IR) systems provide users with a vast amount of reference material. Along with this tremendous access comes the challenge of effectively presenting the user with relevant information in response to a query. When using an IR engine to search through electronic resources, simple queries often return too many documents, including many that are not relevant to the intended search. For instance, there are several million documents on the World Wide Web pertaining to “Michael Jordan.” Most of these concern the basketball star, so it is difficult to find information about the television personality, the jazz musician, the mathematician, or the many others who share that name. It would be useful to have a system that could overcome this limitation.

One approach is to cluster the documents after retrieval and present a synopsis of each cluster so that a user can choose clusters of interest. This is the motivation for our Query, Cluster, Summarize (QCS) system, which performs the following tasks in response to a query:

  • retrieves relevant documents,

  • separates the retrieved documents into clusters by topic, and

  • creates a summary for each cluster.

Our implementation of the QCS system partitions the code into portable modules, making it easy to experiment with different methods for handling the three main tasks listed above. In our current implementation of the QCS system, we use existing software libraries for each task. Throughout this paper, we discuss our choices for each of the modules used, but note that it is possible to exchange individual modules with other methods. Our goal in this work is to produce short summaries (∼100 words) for clusters of documents organized by topic and listed in descending order of relevance to a user’s query.

Many existing retrieval systems focus on organizing individual documents related to a query by using one or more ranking algorithms to order the retrieved documents. For example, Google uses its hyperlink analysis algorithm called PageRank to order documents retrieved from web-based collections (Google, 2006). The combination of link analysis and text-based topic analysis employed by the ExpertRank (Teoma) algorithm in the Ask search engine (Search & Median, 2006) results in implicit topic clustering. In such systems, short extractions of the documents containing one or more of the query terms are displayed to users.

Examples of retrieval systems employing clustering algorithms for organizing sets of retrieved documents include Velocity/Clusty (Vivisimo, 2006), Infonetware/RealTerm (Infogistics, 2001), WiseNut (LookSmart, 2006), Accumo (Accumo, 2006), iBoogie (CyberTavern, 2006), and the KartOO and Ujiko systems (KartOO, 2006). These systems organize the documents into clusters and generate a list of keywords associated with each cluster. The latter two systems also present graphical representations of the resulting clusters. As with the retrieval systems above, these systems also present document extractions containing one or more query terms; the only summary presented is a list of keywords.

Much of the recent research on automatic text summarization systems is available in the proceedings of the 2001–2006 Document Understanding Conferences.1 The focus of these conferences is the evaluation and discussion of summarization algorithms and systems in performing sets of tasks on several types of document collections. Several tasks included in previous DUC evaluations focused on multidocument summarization for clusters of documents, and our participation in these evaluations led to the development of QCS.

Previous work on using a combination of clustering and summarization to improve IR is summarized in Radev, Fan, and Zhang (2001). Of existing IR systems employing this combination, QCS most resembles the NewsInEssence system (Radev, Blair-Goldensohn, Zhang, & Raghavan, 2001) in that both systems can produce multidocument summaries from document sets clustered by topic. However, NewsInEssence is designed for IR from HTML-linked document sets while QCS has been designed for IR from generic document sets.

Another system that leverages clustering and summarization for information organization similarly to QCS is the Columbia Newsblaster system (McKeown et al., 2002). Newsblaster, like NewsInEssence, is a web-based system which crawls news websites and then clusters and summarizes the news stories, but it does not currently accept queries. Recently, the value of summarization to users in IR has been demonstrated by Maña-López, de Beunaga, and Gómez-Hidalgo (2004), whose study showed increases in user recall of retrieved information when clustering and summarization were included in the output of the IR system.

We have used QCS for information retrieval in two information domains: biomedical abstracts from the US National Library of Medicine’s MEDLINE database, discussed in Dunlavy, Conroy, O’Leary, and O’Leary (2003), and newswire documents from the 2002–2004 DUC evaluations, discussed here.

In Section 2, we discuss our choices for each of the components of the QCS system. An example of use of the QCS system is presented in Section 3. Section 4 presents results of experiments evaluating some of the components of the implementation, and we conclude in Section 5.

Section snippets

The QCS system

QCS is a collection of software modules developed in the languages C and C++ and tested under the operating systems SunOS 5.8 (Solaris 8) and Linux (kernel v2.4). Preprocessing tools for all QCS data, including processing of the data passed from one module to another, were developed in the Perl language. QCS has been developed as a client-server application, and the implementation took approximately 6 person-months of full-time effort.

In this section we describe the components of our system:

Example of the QCS system

We present an example of the entire QCS system from the standpoint of the user. The example uses the query hurricane earthquake in finding documents in the DUC 2002 document collection. The DUC 2002 collection consists of 567 documents and the QCS preprocessing modules identified 7767 unique terms across the collection.

Table 1 shows the highest query similarity scores along with the first “subject” sentence (i.e., first sentence with stype = 0) from each document. In this example, a rank-50

Experiments

In this section, we describe the results of two sets of experiments performed to test QCS on various document collections. We first present the data used in the experiments, then describe our framework for evaluating the performance of QCS and finally present the results of several tests performed within this framework. Tests were performed on a Sun Ultra60 with a 450 MHz processor and 512 Mb of RAM running Solaris 8. Further results for testing QCS, including timings and additional retrieval

Conclusions

QCS is a tool for document retrieval that presents results in a format so that a user can quickly identify a set of documents of interest. The results include a multidocument summary of each cluster of documents, a summary of each individual document, a pointer to each document, and pointers to documents from which the multidocument extract summary was derived. Results of using QCS on the DUC document set illustrate the usefulness of this system; in particular, we provide evidence of the value

Acknowledgements

We thank the authors of LT TTT, GTP, and GMEANS for the use of their code, Timothy O’Leary for his assistance with the MEDLINE data set used in QCS, and Tamara Kolda for her helpful suggestions during the preparation of this manuscript. Daniel Dunlavy was supported in part by the Applied Mathematics Research program of the Office of Advanced Scientific Computing Research of DOEs Office of Science and by Sandia National Laboratories, a multiprogram laboratory operated by Sandia Corporation, a

Daniel M. Dunlavy is a John von Neumann Fellow at Sandia National Laboratories. His research interests include application of numerical linear algebra and optimization methods in the areas of informatics and biological and engineering sciences. He received his BA in Computer Studies from Northwestern University, an MA in Applied Mathematics from Western Michigan University, and an MS and PhD in Applied Mathematics and Scientific Computing from the University of Maryland. He is a member of the

References (32)

  • Accumo (2006). Accumo – enterprise text categorization – technology. Retrieved December 4, 2006, from...
  • L.E. Baum et al.

    Maximization technique occurring in statistical analysis of probabilistic functions in Markov chains

    The Annals of Mathematical Statistics

    (1970)
  • Conroy, J. M., & O’Leary, D. P. (2001). Text summarization via hidden Markov models and pivoted QR matrix...
  • Conroy, J. M., Schlesinger, J. D., Goldstein, J., & O’Leary, D. P. (2004). Left-brain/right-brain multi-document...
  • CyberTavern, LLC, (2006). iboogie – metasearch document clustering engine and personalized search engines directory....
  • Dang, H. T. (2005). Overview of DUC 2005. In Proc. document understanding...
  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society for Information Science

    (1990)
  • I.S. Dhillon et al.

    Efficient clustering of very large document collections

  • Dhillon, I. S., Guan, Y., & Kogan, J. (2002). Iterative clustering of high dimensional text data augmented by local...
  • I.S. Dhillon et al.

    Concept decompositions for large sparse text data using clustering

    Machine Learning

    (2001)
  • Dunlavy, D. M., Conroy, J. M., O’Leary, T. J., & O’Leary, D. P. (2003). Clustering and summarizing Medline abstracts....
  • Dunlavy, D. M., Conroy, J. M., Schlesinger, J. D., Goodman, S. A., Okurowski, M. E., O’Leary, D. P., et al. (2003)....
  • Dunlavy, D. M., O’Leary, D. P., Conroy, J. M., & Schlesinger, J. D. (2006). QCS: a system for querying, clustering and...
  • T. Dunning

    Accurate methods for statistics of surprise and coincidence

    Computational Linguistics

    (1993)
  • W.N. Francis et al.

    Frequency analysis of English usage: Lexicon and grammar

    (1982)
  • J.T. Giles et al.

    GTP (general text parser) software for text mining

  • Cited by (64)

    • SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders

      2019, Expert Systems with Applications
      Citation Excerpt :

      The summarization can also be categorized as single-document and multi-document depending on the number of input documents given (Zajic, Dorr, & Lin, 2008; Fattah & Ren, 2009). Similarly, the summarization can be either generic or query-based (Gong & Liu, 2001; Dunlavy, O'Leary, Conroy, & Schlesinger, 2007; Wan, 2008; Ouyang, Li, Li, & Lu, 2011). The generic summarization provides an overall idea of the document content whereas, query-focused summarization presents the relevant content of the document according to the user given queries.

    View all citing articles on Scopus

    Daniel M. Dunlavy is a John von Neumann Fellow at Sandia National Laboratories. His research interests include application of numerical linear algebra and optimization methods in the areas of informatics and biological and engineering sciences. He received his BA in Computer Studies from Northwestern University, an MA in Applied Mathematics from Western Michigan University, and an MS and PhD in Applied Mathematics and Scientific Computing from the University of Maryland. He is a member of the Society for Industrial and Applied Mathematics (SIAM).

    Dianne P. O’Leary is a professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland. She received her BS in mathematics from Purdue University and her PhD in computer science from Stanford. She has authored over 75 journal articles on computational linear algebra and optimization, algorithms for high-performance computers, numerical solution of ill-posed problems, and scientific computing. She is a fellow of the ACM and a member of SIAM and the Association for Women in Mathematics.

    John M. Conroy is a research staff member for the IDA Center for Computing Sciences in Bowie, MD. His research interest is applications of numerical linear algebra. He has published papers in high performance computing, pattern classification, anomaly detection, and text summarization. He received his B.S. in Mathematics from Saint Joseph’s University, and his PhD from the Applied Mathematics Program at the University of Maryland. He is a member of SIAM, IEEE, and the Association for Computational Linguistics (ACL).

    Judith D. Schlesinger is a research staff member at the IDA Center for Computing Sciences in Bowie, MD. Her research interests span programming and natural languages, and she has published papers on intelligent tutoring systems, parallel programming, program understanding, and automatic summarization. She received her BS in mathematics from Brooklyn College, CUNY/SUNY, her MS in computer and information science from The Ohio State University, and her PhD in computer science from The Johns Hopkins University. She is a member of the ACM, IEEE Computer Society, and ACL.

    View full text