Two uses of anaphora resolution in summarization
Introduction
Information about anaphoric relations could be beneficial for applications such as summarization and segmentation that involve extracting (possibly very simplified) discourse models from text. In this work we investigated exploiting automatically extracted information about the anaphoric relations in a text for two different aspects of the summarization task. First of all, we used anaphoric information to enrich the latent semantic representation of a document, from which a summary is then extracted. Secondly, we used anaphoric information to check that the anaphoric expressions contained in the summary thus extracted still have the same interpretation that they had in the original text.
Our approach to summarization follows what has been called a term-based approach (Hovy & Lin, 1997). In term-based summarization, the most important information in a document is found by identifying its main ‘terms’ (also sometimes called ‘topics’), and then extracting from the document the most important information about these terms. Such approaches are usually classified as ‘lexical’ approaches or ‘coreference- (or anaphora-) based’ approaches. Lexical approaches to summarization use word similarity and other lexical relations to identify central terms (Barzilay & Elhadad, 1997); we would include among these previous approaches based on lsa (Landauer & Dumais, 1997), such as Gong and Liu, 2002, Steinberger and Jezek, 2004. Coreference- or anaphora-based approaches1 (Baldwin and Morton, 1998, Bergler et al., 2003, Boguraev and Kennedy, 1999, Stuckardt, 2003) identify these terms by running a coreference- or anaphoric resolver over the text. We are not aware, however, of any attempt to use both lexical and anaphoric information to identify the main terms, other than our own previous work (Steinberger, Kabadjov, & Poesio, 2005).
In this paper, we present a new lsa-based sentence extraction summarizer which uses both lexical and anaphoric information. We already found in previous work that feeding (automatically extracted) information about anaphoric relations to a summarizer can improve its performance (Steinberger et al., 2005). In the work discussed here, however, we improve our previous methods in several respects. Firstly of all, we improved our methods for extracting a summary from a document in two ways. Firstly, we developed an improved version of our anaphoric resolver, guitar, so that now it can also identify coreference relations between proper nouns, which are often used to indicate key terms. Secondly, we improved our method for extracting summaries from an lsa-style representation by developing a new approach for finding the dimensionality reduction on the basis of the required (percentage) size of the summary. The new system was evaluated on the standard reference corpus from DUC-2002, making it possible to compare its performance not only with that of two lsa-based summarizers using only lexical information, but also with that of the other systems participating in DUC-2002, using the standard rouge evaluation measure.
In addition, we also propose here a method for using anaphoric information to check the entity-coherence of a summary once this has been extracted. Summarization by sentence extraction may produce summaries with ‘dangling’ anaphoric expressions – expressions whose antecedent has not been included in the summary, and therefore cannot be interpreted or are interpreted incorrectly. Our algorithm checks that the interpretation of anaphoric expressions in a summary is consistent with their interpretation in the original text. The algorithm can be used irrespective of whether the summary was produced using our own methods or other methods.
The structure of the paper is as follows. In Section 2, some background information is presented. Our previous work using pure Latent Semantic Analysis (lsa) for summarization is discussed, and we present two lsa-based summarizers that only use lexical information to identify the main topics of a document. We then make the case for using anaphoric information as well as lexical information, and introduce our anaphora resolution system (Section 3). Then, in Section 4, we discuss our methods for using anaphoric information in lsa, and their evaluation using the DUC-2002 corpus. In Section 5 we present the last step in our summarization approach, the summary reference checker. In the end, we discuss our vision of applying the presented ideas to multi-document summarization.
Section snippets
Previous work
lsa (Landauer & Dumais, 1997) is a technique for extracting the ‘hidden’ dimensions of the semantic representation of terms, sentences, or documents, on the basis of their use. It has been extensively used in educational applications such as essay ranking (Landauer & Dumais, 1997), as well as in nlp applications including information retrieval (Berry, Dumais, & O’Brien, 1995) and text segmentation (Choi, Wiemer-Hastings, & Moore, 2001).
More recently, a method for using lsa for multi- and
Motivations
Boguraev and Kennedy (1999) use the following news article to illustrate why being able to recognize anaphoric chains may help in identifying the main topics of a document.
A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today
Combining lexical information and anaphoric information to build the LSA representation
‘Purely lexical’ lsa determines the main ‘topics’ of a document on the basis of the simplest possible notion of term, simple words, as usual in lsa. In this section we will see, however, that anaphoric information can be easily integrated in an mixed lexical/anaphoric lsa representation by generalizing the notion of ‘term’ used in svd matrices to include discourse entities as well, and counting a discourse entity d as occurring in sentence s whenever the anaphoric resolver identifies a noun
A summary (entity) coherence checker
Anaphoric expressions can only be understood with respect to a context. This means that summarization by sentence extraction can wreak havoc with their interpretation: there is no guarantee that they will have an interpretation in the context obtained by extracting sentences to form a summary, or that this interpretation will be the same as in the original text. Consider the following example.
- (S1)
[Prime Minister Margaret Thatcher]1 said Monday [[the
Future work: multi-document summarization
We are currently working to apply the methods proposed here to multi-document summarization.
The single-document lsa approach can be easily extended to process multiple documents by including all sentences in a cluster of documents in the svd input matrix. The latent space would be then reduced to r dimensions according to the dimensionality reduction approach as done currently (see Section 2.3). The sentence selection approach can be used as well; however, care has to be taken to avoid
Conclusion and further research
In this paper we presented three main contributions. First of all, we developed a method for using both anaphoric and lexical information to build an lsa representation which works well for summarization and, we believe, can also be used in other applications – e.g., in our work on text segmentation (Kabadjov, 2007). This method involves two novel ideas. First of all, we improved the Gong and Liu method for extracting a summary from an lsa representation, by providing a new method for
Acknowledgement
This research was partly supported by project 2C06009 (COT-SEWing).
References (28)
- Baldwin, B., & Morton, T. S. (1998). Dynamic coreference-based summarization. In Proceedings of EMNLP, Granada,...
- Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL/EACL...
- Bergler, S., Witte, R., Khalife, M., Li, Z., & Rudzicz, F. (2003). Using knowledge-poor coreference resolution for text...
- et al.
Using linear algebra for intelligent IR
SIAM Review
(1995) - et al.
Salience-based content characterization of text documents
- Bontcheva, K., Dimitrov, M., Maynard, D., Tablan, V., & Cunningham, H. (2002). Shallow methods for named entity...
- Charniak, E. (2000). A maximum-entropy-inspired parser. In Proceedings of NAACL, Philadelphia,...
- Choi, F. Y. Y., Wiemer-Hastings, P., & Moore, J. D. (2001) Latent semantic analysis for text segmentation. In...
A probabilistic model for latent semantic indexing
Journal of the American Society for Information Science and Technology
(2005)- Gong, Y., & Liu, X. (2002). Generic text summarization using relevance measure and latent semantic analysis. In...