Applying regression models to query-focused multi-document summarization

https://doi.org/10.1016/j.ipm.2010.03.005Get rights and content

Abstract

Most existing research on applying machine learning techniques to document summarization explores either classification models or learning-to-rank models. This paper presents our recent study on how to apply a different kind of learning models, namely regression models, to query-focused multi-document summarization. We choose to use Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In order to learn the regression models, we propose several methods to construct the “pseudo” training data by assigning each sentence with a “nearly true” importance score calculated with the human summaries that have been provided for the corresponding document set. A series of evaluations on the DUC data sets are conducted to examine the efficiency and the robustness of the proposed approaches. When compared with classification models and ranking models, regression models are consistently preferable.

Introduction

Document summarization techniques are one way of helping people find information effectively and efficiently. There are two main approaches to automatic summarization: abstractive approaches and extractive approaches. Abstractive approaches, which promise to produce summaries that are more like what a human might generate but are limited by the progress of natural language understanding and generation and the more widely used extractive approaches which rank sentences by importance, extract salient sentences, and then compose the summary. Sentence ranking has for some time now been carried out using machine learning techniques. Early techniques usually treated sentence ranking as a binary classification problem where classification models learnt from sets of “important” and “unimportant” sentences sets to build classification models which were used to identify “key” sentences. Subsequent learning-to-rank approaches normally learned from ordered sentence pairs or lists, which are easier to obtain than the training data for classification models. In either case, most classification and ranking models transform binary or ordering information into continuous values which are used in classification or ranking.

An alternative to these approaches is offered by regression models, which learn continuous functions which directly estimate the importance of sentences, which can be better characterized as continuous than discrete. Another advantage of regression models is that their training data uses continuous importance values, unlike the classification or ranking models which use either discrete sentence labels or ranked sentence pairs. Thus, the learned regression functions should estimate sentence importance more accurately, depending on the quality and the quantity of the training data, whether it is obtained manually or automatically.

In this paper, we study how to apply regression models to the sentence ranking problem in query-focused multi-document summarization. We implement the regression models using Support Vector Regression (SVR). SVR is the regression type of Support Vector Machines (Vapnik, 1995) and is capable of building state-of-the-art optimum approximation functions. As data for this study, we will construct “pseudo” training data automatically from human summaries and then use these and their document sets to develop and compare several N-gram based methods that estimate “nearly true” sentence importance scores. The training data is then used to learn a mapping function from a set of pre-defined sentence features to these “nearly true” sentence importance scores. The learned function is then used to predict the importance of the sentences in the test data. We carry out a series of experiments to evaluate the efficiency and robustness of our approaches.

The remainder of the paper is organized as follows. Section 2 briefly introduces the related work. Section 3 explains the proposed approach. Section 4 presents experiments, evaluations and discussions. Section 5 concludes the paper.

Section snippets

Related work

The application of machine learning techniques in document summarization has a long history. Kupiec, Pedersen, and Chen (1995) first proposed a trainable summarization approach which adopted word-based features and used a naïve Bayesian classifier to learn feature weights according to a set of given summaries. The learning-based system that combined all the features performed better than any other system using only one single feature. Many early studies followed this idea and extended Kupiec et

Regression models for query-focused multi-document summarization

Our summarization approach is built upon the typical feature-based extractive framework, which ranks and extracts sentences according to a set of pre-defined sentence features and a composite scoring function. The learning models in our feature-based approach thus search for an optimum composite scoring function with the fixed feature set.

Experiment set-up

We conduct a series of experiments on the query-focused multi-document summarization task initiated by DUC in 2005. The task requires creating a brief, well-organized, and fluent summary from a set of documents related to a given query. This task has been specified as the main evaluation task over 3 years (2005–2007) and thus provides a good benchmark for researchers to exchange their ideas and experiences in this area. Each year, DUC assessors develop a total of about 50 DUC topics. Each topic

Conclusion and future work

This paper has presented our studies of how to develop a regression-style sentence ranking scheme for query-focused multi-document summarization. We examined different methods for constructing the training data based on human summaries. We also presented what we have learned on how to construct good training data and compared the effectiveness of different learning models for sentence ranking. Our experiments have shown that regression models are to be preferred over classification models or

Acknowledgements

The work described in this paper was partially supported by Hong Kong RGC Projects (PolyU5211/05E and PolyU5217/07E), NSFC programs (60603093 and 60875042) and 973 National Basic Research Program of China (2004CB318102).

You Ouyang is currently a Ph.D. student in the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. He received the B.Sc. and M.Sc. degree in Peking University, Beijing, China, in 2004 and 2007 respectively. His main research interests include statistical natural language processing, text mining and data mining.

References (24)

  • Amini, M. R., Usunier, N. (2009). Incorporating prior knowledge into a transductive ranking algorithm for...
  • M.R. Amini et al.

    Automatic text summarization based on word-clusters and ranking algorithms

  • Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In...
  • Carbonell, G. J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and...
  • Chuang, W. T., & Yang, J. (2000). Extracting sentence segments for text summarization: a machine learning approach. In...
  • Conroy, J. M., Schlesinger, J. D., & O’Leary, D. P. (2006). Topic-focused multi-document summarization using an...
  • Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002) GATE: A framework and graphical development environment...
  • Dang, H. T. (2005). Overview of DUC 2005. In Document understanding conference 2005....
  • Fisher, S., & Roark, B. (2006). Query-focused summarization by supervised sentence ranking and skewed word...
  • Hirao, T., & Isozaki, H. (2002). Extracting important sentences with support vector machines. In Proceedings of the...
  • T. Joachims

    Making large-scale SVM learning practical

  • Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM conference on...
  • Cited by (0)

    You Ouyang is currently a Ph.D. student in the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. He received the B.Sc. and M.Sc. degree in Peking University, Beijing, China, in 2004 and 2007 respectively. His main research interests include statistical natural language processing, text mining and data mining.

    Wenjie Li is currently an assistant professor in the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. She received her Ph.D. degree from department of systems engineering and engineering management in the Chinese University of Hong Kong, Hong Kong, in 1997. Her main research topics include natural language processing, information extraction and temporal information processing.

    Sujian Li is currently an assistant professor in the Key Laboratory of Computational Linguistics, Peking University, China. Her main research topics include Information Extraction, Automatic Indexing, Computational Linguistics.

    Qin Lu is currently a professor and associate head of the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. Her research has been on open systems especially in interoperability and internationalization, Chinese computing and natural language processing. She is currently the rapporteur of the ISO/IEC/JTC1/SC2/WG2’s ideographic Rapporteur Group for the standardization of ideograph characters in the ISO/IEC 10646 standard.

    View full text