Skip to main content
main-content

Über dieses Buch

This book describes recent advances in text summarization, identifies remaining gaps and challenges, and proposes ways to overcome them. It begins with one of the most frequently discussed topics in text summarization – ‘sentence extraction’ –, examines the effectiveness of current techniques in domain-specific text summarization, and proposes several improvements. In turn, the book describes the application of summarization in the legal and scientific domains, describing two new corpora that consist of more than 100 thousand court judgments and more than 20 thousand scientific articles, with the corresponding manually written summaries. The availability of these large-scale corpora opens up the possibility of using the now popular data-driven approaches based on deep learning. The book then highlights the effectiveness of neural sentence extraction approaches, which perform just as well as rule-based approaches, but without the need for any manual annotation. As a next step, multiple techniques for creating ensembles of sentence extractors – which deliver better and more robust summaries – are proposed. In closing, the book presents a neural network-based model for sentence compression. Overall the book takes readers on a journey that begins with simple sentence extraction and ends in abstractive summarization, while also covering key topics like ensemble techniques and domain-specific summarization, which have not been explored in detail prior to this.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
Automatic Summarisation, or reducing a text document while retaining its most essential points, is not a new research area. The first notable attempt, which dates back to 1958, was made by [14]. It uses word frequencies to identify significant words in a given sentence. The importance of a sentence is then determined from the number of significant words it has and proximity of these words to each other. Since then the techniques for both sentence selection (extractive summarisation) as well as abstract generation (abstractive summarisation) have advanced a lot. However there are some aspects of text summarisation which have not received much attention. This book intends to cover those aspects, like domain-specific summarisation and ensemble-based techniques in a greater detail. In this chapter we provide a quick overview of the three major types of summarisation systems along with their pros and cons, a glimpse into the overall content of this book as well as its principle contributions.
Parth Mehta, Prasenjit Majumder

Chapter 2. Related Work

Abstract
In this chapter, we examine some of the existing techniques for sentence extraction and sentence compression. We also discuss the existing approaches for creating ensembles of these systems. We point out specific cases where the existing techniques would not work and what needs to be done to handle such cases. We discuss various existing approaches for extractive summarisation in such a way as to maximise the representation from several different categories of extractive techniques. In contrast to the extractive techniques, ensemble techniques for summarisation are fewer, but at the same time warrant a detailed discussion in order to present our arguments. We discuss all the existing ensemble approaches, their strengths and weaknesses. For abstractive summarisation/sentence compression we classify the approaches into two major groups: those dependent on linguistic resources and the ones that are completely data-driven. We end the chapter with an overview of domain-specific summarisation approaches, specifically related to legal and scientific articles.
Parth Mehta, Prasenjit Majumder

Chapter 3. Corpora and Evaluation for Text Summarisation

Abstract
A standard benchmark collection is essential to the reproducibility of any research. Several initial works in text summarisation suffered due to lack of standard evaluation corpora at that time [1, 8]. The advent of conferences like Document Understanding Conference(DUC) [2] and Text Analysis Conference(TAC) [18] solved that problem. These conferences generated standard evaluation benchmarks for text summarisation and as a result streamlined efforts were made possible. Today such benchmark collections of documents and related manually written summaries, provided by DUC and TAC are by far the most widely used collections for text summarisation. These have become essential for reproducibility as well as comparison of cross-system performance. However, with a lot of data-driven approaches being suggested in last few years the DUC and TAC collection, with their hundreds of article summary pairs, are no longer sufficient. There are a few other corpora like the Gigaword corpus and CNN/Dailymail [21] corpus which have millions of document-summary pairs. But these corpora are not publicly available and hence are of limited use. Moreover both these corpora, and also DUC and TAC, consist only of newswires. However, TAC did later introduced a task on biomedical article summarisation, which we discuss later in this chapter. But overall there are few domain-specific corpora that are both substantially large, to benefit the data-driven approaches, as well as publicly available. In this work we propose two new corpora for domain-specific summarisation in legal and scientific domains. The legal corpus consists of judgements delivered by the Supreme Court of India and their associated summary that are handwritten by legal experts. The corpus of scientific articles consists of research papers from the ACL anthology, which is a publicly available repository of research papers from computational linguistics and related domains. In this chapter we briefly discuss the DUC and TAC corpora as well as the corpora developed as a part of this work. We also provide an overview of the various strategies that are used to evaluate summarisation systems.
Parth Mehta, Prasenjit Majumder

Chapter 4. Domain-Specific Summarisation

Abstract
Automatic text summarisation, especially sentence extraction, has received a great deal of attention from researchers. However, a majority of the work focuses on newswire summarisation where the goal is to generate headlines or short summaries from a single news article or a cluster of related news articles. One primary reason for this is the fact that most public datasets related to text summarisation consist of newswire articles. Whether it is the traditional Document Understanding Conference (DUC) or Text Analysis Conference (TAC) datasets or the recent CNN/Daily mail corpus, the focus is mainly on newswire articles. In reality, this forms a rather small part of the numerous possible applications of text summarisation. The focus is now shifting towards other areas like product-review summarisation, domain-specific summarisation and real-time summarisation. Each of these areas have their own sets of challenges, but they have one issue in common, i.e. availability of large-scale corpora which can be used for supervised or semi-supervised learning. In this work, we highlight two such use cases, related to summarising legal and scientific articles, which are very different from the generic document summarisation tasks. We discuss how these are different from generic newswire summarisation, introduce two new corpora for these domains and propose new keyword based as well as neural sentence extraction techniques.
Parth Mehta, Prasenjit Majumder

Chapter 5. Improving Sentence Extraction Through Rank Aggregation

Abstract
A plethora of extractive summarisation techniques have been developed in the past decade, but very few enquiries have been made as to how these differ from each other or what factors affect these systems. Such meaningful comparison if available can be used to create a robust ensemble of these approaches, which has the possibility to consistently outperform each individual summarisation system. In this chapter we examine the roles of three principle components of an extractive summarisation technique: sentence ranking algorithm, sentence similarity metric and text representation scheme. We show that using a combination of several different sentence similarity measures, rather than choosing any particular measure, significantly improves performance of the resultant meta-system. Even simple ensemble techniques, when used in an informed manner, prove to be very effective in improving the overall performance and consistency of summarisation systems. While aggregating multiple ranking algorithms or text similarity measures, though the improvement in ROUGE score is not always significant, the resultant meta-systems are more robust than candidate systems. The results suggest that, when proposing a sentence extraction technique, defining better sentence similarity metrics would be more impactful than a new ranking algorithm. Also using multiple sentence similarity scores and ranking algorithms in favour of a particular combination always results in an improved and robust performance.
Parth Mehta, Prasenjit Majumder

Chapter 6. Leveraging Content Similarity in Summaries for Generating Better Ensembles

Abstract
Previously in Chap. 5 we described the technique to effectively aggregate rank lists by variation in sentence similarity, text representation and ranking algorithms. This was part of a larger family of Consensus-based summarisation systems, that democratically select common content from several candidate systems by taking into account the individual rankings of candidates. In this chapter, we highlight the significant limitations of consensus-based systems that rely only on sentence ranking and not on the actual content of the candidate summaries. Their inability to take into account relative performance of individual systems and overlooking content of candidate summaries in favour of the sentence rankings limits their performance in several cases. We suggest an alternate approach that can potentially overcome these limitations. We show how, in the absence of gold standard summaries, the candidates can act as pseudo-relevant summaries to estimate the performance of individual systems. We then use this information to generate a better aggregate. Experiments show that the proposed content-based aggregation system outperforms existing rank list based aggregation techniques by a large margin.
Parth Mehta, Prasenjit Majumder

Chapter 7. Neural Model for Sentence Compression

Abstract
The neural sentence extraction model discussed in Chap. 3, as well as the ensemble, approaches in Chaps. 4 and 5, all solely focus on choosing a subset of sentences which gives the best ROUGE scores. However, like with any extractive techniques in general, these approaches have a limitation when generating a summary of fixed size. In the absence of reliable generative techniques, which can generate new concise sentences the next logical step is to eliminate redundant or less informative content from the extracted sentences. The two possible ways to achieve this is sentence compression and sentence simplification. While sentence compression solely deals with removing redundant information, sentence simplification usually looks into replacing a difficult phrase or word with a simpler alternative. In case of legal documents, usually replacing long legal phrases with more commonly used phrases also leads to sentence compression. This improves the precision of fixed-length summaries. We present a new approach for sentence compression for legal documents where we demonstrate how a phrase-based statistical machine translation system can be modified to generate meaningful sentence compressions. We compare this approach to an LSTM-based sentence compression technique. Next, we show how this problem can be modelled as a sequence to sequence mapping problem, thus not limiting to just deleting words, but also having a possibility of introducing new words in the target sentence.
Parth Mehta, Prasenjit Majumder

Chapter 8. Conclusion

Abstract
In this chapter we present an overview of the work that we discussed throughout the book and point out to some open questions and possible research directions. We proposed several techniques that can improve or compliment the existing sentence extraction systems. We introduced two new corpus consisting of Legal and scientific articles that can be used for evaluating sentence compression and abstractive summarisation systems. We then proposed a attention model-based sentence extraction technique that is capable of identifying key information from the documents, without requiring any manually labelled data. We showed that such techniques that use large number of pseudo-labelled data can easily outperform the systems that use domain knowledge and manual annotations.
Parth Mehta, Prasenjit Majumder

Backmatter

Weitere Informationen