Two uses of anaphora resolution in summarization

https://doi.org/10.1016/j.ipm.2007.01.010Get rights and content

Abstract

We propose a new method for using anaphoric information in Latent Semantic Analysis (lsa), and discuss its application to develop an lsa-based summarizer which achieves a significantly better performance than a system not using anaphoric information, and a better performance by the rouge measure than all but one of the single-document summarizers participating in DUC-2002. Anaphoric information is automatically extracted using a new release of our own anaphora resolution system, guitar, which incorporates proper noun resolution. Our summarizer also includes a new approach for automatically identifying the dimensionality reduction of a document on the basis of the desired summarization percentage. Anaphoric information is also used to check the coherence of the summary produced by our summarizer, by a reference checker module which identifies anaphoric resolution errors caused by sentence extraction.

Introduction

Information about anaphoric relations could be beneficial for applications such as summarization and segmentation that involve extracting (possibly very simplified) discourse models from text. In this work we investigated exploiting automatically extracted information about the anaphoric relations in a text for two different aspects of the summarization task. First of all, we used anaphoric information to enrich the latent semantic representation of a document, from which a summary is then extracted. Secondly, we used anaphoric information to check that the anaphoric expressions contained in the summary thus extracted still have the same interpretation that they had in the original text.

Our approach to summarization follows what has been called a term-based approach (Hovy & Lin, 1997). In term-based summarization, the most important information in a document is found by identifying its main ‘terms’ (also sometimes called ‘topics’), and then extracting from the document the most important information about these terms. Such approaches are usually classified as ‘lexical’ approaches or ‘coreference- (or anaphora-) based’ approaches. Lexical approaches to summarization use word similarity and other lexical relations to identify central terms (Barzilay & Elhadad, 1997); we would include among these previous approaches based on lsa (Landauer & Dumais, 1997), such as Gong and Liu, 2002, Steinberger and Jezek, 2004. Coreference- or anaphora-based approaches1 (Baldwin and Morton, 1998, Bergler et al., 2003, Boguraev and Kennedy, 1999, Stuckardt, 2003) identify these terms by running a coreference- or anaphoric resolver over the text. We are not aware, however, of any attempt to use both lexical and anaphoric information to identify the main terms, other than our own previous work (Steinberger, Kabadjov, & Poesio, 2005).

In this paper, we present a new lsa-based sentence extraction summarizer which uses both lexical and anaphoric information. We already found in previous work that feeding (automatically extracted) information about anaphoric relations to a summarizer can improve its performance (Steinberger et al., 2005). In the work discussed here, however, we improve our previous methods in several respects. Firstly of all, we improved our methods for extracting a summary from a document in two ways. Firstly, we developed an improved version of our anaphoric resolver, guitar, so that now it can also identify coreference relations between proper nouns, which are often used to indicate key terms. Secondly, we improved our method for extracting summaries from an lsa-style representation by developing a new approach for finding the dimensionality reduction on the basis of the required (percentage) size of the summary. The new system was evaluated on the standard reference corpus from DUC-2002, making it possible to compare its performance not only with that of two lsa-based summarizers using only lexical information, but also with that of the other systems participating in DUC-2002, using the standard rouge evaluation measure.

In addition, we also propose here a method for using anaphoric information to check the entity-coherence of a summary once this has been extracted. Summarization by sentence extraction may produce summaries with ‘dangling’ anaphoric expressions – expressions whose antecedent has not been included in the summary, and therefore cannot be interpreted or are interpreted incorrectly. Our algorithm checks that the interpretation of anaphoric expressions in a summary is consistent with their interpretation in the original text. The algorithm can be used irrespective of whether the summary was produced using our own methods or other methods.

The structure of the paper is as follows. In Section 2, some background information is presented. Our previous work using pure Latent Semantic Analysis (lsa) for summarization is discussed, and we present two lsa-based summarizers that only use lexical information to identify the main topics of a document. We then make the case for using anaphoric information as well as lexical information, and introduce our anaphora resolution system (Section 3). Then, in Section 4, we discuss our methods for using anaphoric information in lsa, and their evaluation using the DUC-2002 corpus. In Section 5 we present the last step in our summarization approach, the summary reference checker. In the end, we discuss our vision of applying the presented ideas to multi-document summarization.

Section snippets

Previous work

lsa (Landauer & Dumais, 1997) is a technique for extracting the ‘hidden’ dimensions of the semantic representation of terms, sentences, or documents, on the basis of their use. It has been extensively used in educational applications such as essay ranking (Landauer & Dumais, 1997), as well as in nlp applications including information retrieval (Berry, Dumais, & O’Brien, 1995) and text segmentation (Choi, Wiemer-Hastings, & Moore, 2001).

More recently, a method for using lsa for multi- and

Motivations

Boguraev and Kennedy (1999) use the following news article to illustrate why being able to recognize anaphoric chains may help in identifying the main topics of a document.PRIESTISCHARGEDWITHPOPEATTACK

A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today

Combining lexical information and anaphoric information to build the LSA representation

‘Purely lexical’ lsa determines the main ‘topics’ of a document on the basis of the simplest possible notion of term, simple words, as usual in lsa. In this section we will see, however, that anaphoric information can be easily integrated in an mixed lexical/anaphoric lsa representation by generalizing the notion of ‘term’ used in svd matrices to include discourse entities as well, and counting a discourse entity d as occurring in sentence s whenever the anaphoric resolver identifies a noun

A summary (entity) coherence checker

Anaphoric expressions can only be understood with respect to a context. This means that summarization by sentence extraction can wreak havoc with their interpretation: there is no guarantee that they will have an interpretation in the context obtained by extracting sentences to form a summary, or that this interpretation will be the same as in the original text. Consider the following example.PRIMEMINISTERCONDEMNSIRAFORMUSICSCHOOLEXPLOSION

  • (S1)

    [Prime Minister Margaret Thatcher]1 said Monday [[the

Future work: multi-document summarization

We are currently working to apply the methods proposed here to multi-document summarization.

The single-document lsa approach can be easily extended to process multiple documents by including all sentences in a cluster of documents in the svd input matrix. The latent space would be then reduced to r dimensions according to the dimensionality reduction approach as done currently (see Section 2.3). The sentence selection approach can be used as well; however, care has to be taken to avoid

Conclusion and further research

In this paper we presented three main contributions. First of all, we developed a method for using both anaphoric and lexical information to build an lsa representation which works well for summarization and, we believe, can also be used in other applications – e.g., in our work on text segmentation (Kabadjov, 2007). This method involves two novel ideas. First of all, we improved the Gong and Liu method for extracting a summary from an lsa representation, by providing a new method for

Acknowledgement

This research was partly supported by project 2C06009 (COT-SEWing).

References (28)

  • Baldwin, B., & Morton, T. S. (1998). Dynamic coreference-based summarization. In Proceedings of EMNLP, Granada,...
  • Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL/EACL...
  • Bergler, S., Witte, R., Khalife, M., Li, Z., & Rudzicz, F. (2003). Using knowledge-poor coreference resolution for text...
  • M.W. Berry et al.

    Using linear algebra for intelligent IR

    SIAM Review

    (1995)
  • B. Boguraev et al.

    Salience-based content characterization of text documents

  • Bontcheva, K., Dimitrov, M., Maynard, D., Tablan, V., & Cunningham, H. (2002). Shallow methods for named entity...
  • Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of NAACL, Philadelphia,...
  • Choi, F. Y. Y., Wiemer-Hastings, P., & Moore, J. D. (2001) Latent semantic analysis for text segmentation. In...
  • Ch.H.Q. Ding

    A probabilistic model for latent semantic indexing

    Journal of the American Society for Information Science and Technology

    (2005)
  • Gong, Y., & Liu, X. (2002). Generic text summarization using relevance measure and latent semantic analysis. In...
  • L. Hasler et al.

    Building better corpora for summarization

  • Hovy, E., & Lin, C. (1997). Automated text summarization in summarist. In ACL/EACL workshop on intelligent scalable...
  • Kabadjov, M. A. (2007). Anaphora resolution and applications. PhD Dissertation, University of Essex,...
  • Kabadjov, M. A., Poesio, M. & Steinberger, J. (2005). Task-based evaluation of anaphora resolution: the case of...
  • Cited by (0)

    View full text