Elsevier

Speech Communication

Volume 52, Issue 10, October 2010, Pages 801-815
Speech Communication

Long story short – Global unsupervised models for keyphrase based meeting summarization

https://doi.org/10.1016/j.specom.2010.06.002Get rights and content

Abstract

We analyze and compare two different methods for unsupervised extractive spontaneous speech summarization in the meeting domain. Based on utterance comparison, we introduce an optimal formulation for the widely used greedy maximum marginal relevance (MMR) algorithm. Following the idea that information is spread over the utterances in form of concepts, we describe a system which finds an optimal selection of utterances covering as many unique important concepts as possible. Both optimization problems are formulated as an integer linear program (ILP) and solved using public domain software. We analyze and discuss the performance of both approaches using various evaluation setups on two well studied meeting corpora. We conclude on the benefits and drawbacks of the presented models and give an outlook on future aspects to improve extractive meeting summarization.

Introduction

Wherever people work together, there are (regular) meetings to check on the current status, discuss problems or outline future plans. Recording these get-togethers is a good way of documenting and archiving the progress of a group. This can be done for example by a distant microphone on a table or by integrating a storage device in a tele-conference system. Once acquired, these data can serve several purposes: Non-attendants can go through the meeting to get up to date on group discussions, or participants can check certain points of the agenda in case of uncertainty or lack of notes. However, listening to the whole meeting is tedious and one should be able to directly access the relevant information.

Automatic meeting summarization is one step towards the development of efficient user interfaces for accessing meeting archives. In this work, we study the selection of a concise set of relevant utterances1 in meeting transcripts generated by automatic speech recognition (ASR). The selected meeting extracts can then either be juxtaposed to form a short text summarizing a meeting or used as a starting point to enhance browsing experience.

Extractive summarization algorithms often rely on the measurement of two important aspects: relevance (selected elements should be important) and non-redundancy (duplicated content should be avoided). These two aspects are usually addressed by computing separate scores and deciding for the best candidates regarding some relevance redundancy trade-off. Summarization algorithms can be categorized as supervised or unsupervised. A supervised system learns how to extract sentences given example documents and respective summaries. An unsupervised system generates a summary while only accessing the target document. Furthermore, the summarization problem can be specified as single-document, i.e., produce a summary for an independent document, or multi-document, i.e., produce a summary to represent a set of documents which usually cover a similar topic.

For this work, we focus on unsupervised methods. On the one hand, unsupervised methods are very enticing for meeting summarization as they do not depend on extensive manually annotated in-domain training data. They can thus be applied to any new observed data without (or only little) prior adjustments. On the other hand, we only compare unsupervised systems as it is rather unfair to compare unsupervised and supervised systems which are usually applied under different circumstances. If there is enough training data available for the required application, a supervised system may be the method of choice as long as training and test data are from the same domain. If, however, training data is not available, sparse or the test condition is unknown, unsupervised approaches should be considered. This is the case for our scenario as we are interested in a system that can summarize any kind of meeting without prior adjustment or retraining. Nonetheless, to give an idea of the performance of supervised systems, we include experiments with a classification baseline.

Some of the methods presented in this work are rooted in multi-document summarization. We do not use them for their ability to tackle the redundancy naturally occurring in a set of documents on the same topic, but rather to promote diversity in the generated summaries so that even minor topics discussed in a meeting are represented. Diversity is less of an issue in the supervised setup because sentences are represented according to a variety of orthogonal features (position, length, speaker role, cue words, …) which each can lead to relevance. In the unsupervised setup, sentences with the same topical words get similar relevance assessments even if they are pronounced in very different contexts.

The most widely known algorithm for unsupervised summarization is maximum marginal relevance (MMR; Carbonell and Goldstein, 1998). This algorithm iteratively selects the sentence that is most relevant and least redundant to the previously selected ones. The greedy process can thus result in a suboptimal set of sentences as a better selection might be obtained by not choosing the most relevant sentence in the first place.

In this article, we are interested in inference models that seek a global selection of sentences according to relevance and redundancy criteria. Our contributions are as follow:

  • We compare two approaches for global modeling in summarization: sentence-based scoring of relevance and redundancy, and sub-sentence based scoring with implicit redundancy.

    • For sentence-based scoring, we first propose an global formulation for MMR as an integer linear program (ILP). Such a formulation was not proposed before because of non-linearities in MMR. Then, we compare this formulation to the similar model by (McDonald, 2007) which relaxes the non-linearities to a linear function.

    • We outline a different approach to summarization which does not rely on sentence level assessment of redundancy and relevance. Instead, the quality of the summary is determined by the number of important concepts (sub-sentence units) covered. A selection of sentences satisfying this criterion is found again by solving an ILP. This approach is based on the ICSI text summarization system (Gillick et al., 2008, Gillick and Favre, 2009) and was modified for the meeting domain in (Gillick et al., 2009).

  • While most MMR implementations rely on words and their frequency throughout the data, we could already show that using keyphrases instead of words to model relevancy leads to better performance for meeting summarization (Riedhammer et al., 2008a). In addition, keyphrases are used as concepts in the sub-sentence scoring approach. For this work, we refine keyphrase extraction and explore effects of pruning.

  • We compare the complexity of the presented approaches and observe that sentence level models are less scalable than the concept level one.

  • A comprehensive analysis of the summarization performance according to parameters, pruning and length constraints shows that the concept level model yields better properties than the others.

Throughout this work, the we use what we call “keyphrases”. Instead of extracting individual important words commonly known as “keywords”, we extract frequent noun phrases that match a certain pattern of determiners, adjectives and nouns.

This article is structured as follows. We begin with an overview of the related work in Section 2. In Section 4, we describe the two types of summarization models used for this work: sentence and concept based. For sentence-based summarization, we introduce a global formulation for the greedy MMR algorithm as an ILP and discuss how it relates to the formulation in (McDonald, 2007). For concept-based summarization, we present a model that gives credit to the presence of relevant keyphrases in the summary but penalizes them when they occur multiple times and discuss differences to similar approaches as found in (Filatova and Hatzivassiloglou, 2004, Takamura and Okumura, 2009). We conclude the model section with a description of how to extract the keyphrases which are the basis for both models. In Section 6, we describe the experiments we conducted to analyze the performance of the different approaches under fixed and varying constraints, compare greedy to optimal utterance selection and discuss two example summaries. We conclude with a discussion of the scalability of the methods and their flexibility towards practical use and, in a second step, abstractive summarization.

Section snippets

Related work

Speech summarization originated from the porting of methods developed for text summarization. It has been applied to various genres: broadcast news (Hori et al., 2002, Christensen et al., 2004, Zhang and Fung, 2007, Inoue et al., 2004, Maskey and Hirschberg, 2005, Mrozinski et al., 2005), lectures (Mrozinski et al., 2005, Furui et al., 2004), telephone dialogs (Zechner, 2002, Zhu and Penn, 2006) and meeting conversations (Murray et al., 2005a, Liu and Xie, 2008, Riedhammer et al., 2008b). Each

Data

For the experiments described in this work, we used manual and ASR transcripts of the ICSI (Janin et al., 2003) and AMI (McCowan et al., 2005) meeting corpora.

The AMI meeting corpus consists of both scenario (i.e., the topic is given) and non-scenario meetings. For this work, we use a subset of 137 scenario meetings in which four participants play different roles in an imaginary company. They talk about the design and realization of a new kind of remote control. Though the topic was given,

Summarization models

In this section, we detail two models for extractive summarization based on sentence level and concept level scoring. For each of them, we present exact global inference algorithms in form of an ILP which are then solved using the open source ILP solver glpsol from the GNU Linear Programming Kit.2

Relevance, redundancy and concepts

Though the previous section provides theoretical models required to build the summarization systems, the question of how to measure relevance and redundancy and how to find the concepts remains open. In text summarization, relevance is usually defined by a (user generated) query. The relevance score of a candidate sentence is then determined by an overlap measure with that query; redundancy is modeled in a similar way. If no query is provided, an artificial query is generated to represent the

Experiments

From the theory described in Sections 4 Summarization models, 5 Relevance, redundancy and concepts, we build several summarizers to analyze and compare the performance of utterance and concept-based systems:

  • mmr/greedy: The original iterative (greedy) MMR using keyphrase similarity as relevance and word overlap as redundancy measure.

  • mmr/ilp: The proposed ILP for a global formulation of MMR using the same relevance and redundancy scores as above.

  • mcd/ilp: McDonald’s ILP formulation for global

Conclusion and outlook

In this article, we provided an extensive comparison of global sentence and concept-based models for meeting summarization. The former give relevance and redundancy scores to each sentence selected for a summary while the later assess the relevance of sub-sentence units (called concepts) contained in a summary without explicitly modeling redundancy. In our experiments, concept-based models yield best results both in term of summary quality and in term of run time.

Though (greedy) sentence-based

References (49)

  • C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining Knowl. Discovery

    (1998)
  • Carbonell, J., Goldstein, J., 1998. The use of MMR, diversity-based reranking for reordering documents and producing...
  • H. Christensen et al.

    From text summarisation to style-specific summarisation for broadcast news

    Lect. Notes Comput. Sci.

    (2004)
  • Filatova, E., Hatzivassiloglou, V., 2004. Event-based extractive summarization. In: Proc. ACL Workshop on...
  • S. Furui et al.

    Speech-to-text and speech-to-speech summarization of spontaneous speech

    IEEE Trans. Speech Audio Process.

    (2004)
  • Garg, N., Favre, B., Riedhammer, K., Hakkani-Tür, D., 2009. ClusterRank: a graph based method for meeting...
  • Gillick, D., Favre, B., 2009. A scalable global model for summarization. In: Proc. ACL-HLT Workshop on Integer Linear...
  • Gillick, D., Favre, B., Hakkani-Tür, D., 2008. The ICSI Summarization System at TAC’08. In: Proc. of the Text Analysis...
  • Gillick, D., Riedhammer, K., Favre, B., Hakkani-Tür, D., 2009. A global optimization framework for meeting...
  • Ha, L., Sicilia-Garcia, E., Ming, J., Smith, F., 2002. Extension of Zipf’s law to words and phrases. In: Proc....
  • Hori, C., Furui, S., 2000. Improvements in automatic speech summarization and evaluation methods. In: Proc. Internat....
  • Hori, C., Furui, S., Malkin, R., Yu, H., Waibel, A., 2002. Automatic speech summarization applied to English broadcast...
  • Hovy, E., Lin, C., Zhou, L., Fukumoto, J., 2006. Automated summarization evaluation with basic elements. In: Proc....
  • Huang, Z., Harper, M., Wang, W., 2007. Mandarin part-of-speech tagging and discriminative reranking. In: Proc....
  • Inoue, A., Mikami, T., Yamashita, Y., 2004. Improvement of speech summarization using prosodic information. In: Proc....
  • Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A.,...
  • Lin, C., 2004. ROUGE: a package for automatic evaluation of summaries. In: Proc. Workshop on Text Summarization...
  • Lin, H., Bilmes, J., Xie, S., 2009. Graph-based submodular selection for extractive summarization. In: Proc. IEEE...
  • Liu, F., Liu, Y., 2008. Correlation between ROUGE and human evaluation of extractive meeting summaries. In: Proc....
  • Liu, F., Liu, Y., 2009. From extractive to abstractive meeting summaries: can it be done by sentence compression? In:...
  • Liu, Y., Xie, S., 2008. Impact of automatic sentence segmentation on meeting summarization. In: Proc. IEEE Internat....
  • Liu, F., Liu, F., Liu, Y., 2008. Automatic keyword extraction for the meeting corpus using supervised approach and...
  • Liu, F., Pennell, D., Liu, F., Liu, Y., 2009. Unsupervised approaches for automatic keyword extraction using meeting...
  • Maskey, S., Hirschberg, J., 2005. Comparing lexical, acoustic/prosodic, structural and discourse features for speech...
  • Cited by (0)

    View full text