Skip to main content
main-content

Über dieses Buch

This monograph proposes a comprehensive and fully automatic approach to designing text analysis pipelines for arbitrary information needs that are optimal in terms of run-time efficiency and that robustly mine relevant information from text of any kind. Based on state-of-the-art techniques from machine learning and other areas of artificial intelligence, novel pipeline construction and execution algorithms are developed and implemented in prototypical software. Formal analyses of the algorithms and extensive empirical experiments underline that the proposed approach represents an essential step towards the ad-hoc use of text mining in web search and big data analytics.
Both web search and big data analytics aim to fulfill peoples’ needs for information in an adhoc manner. The information sought for is often hidden in large amounts of natural language text. Instead of simply returning links to potentially relevant texts, leading search and analytics engines have started to directly mine relevant information from the texts. To this end, they execute text analysis pipelines that may consist of several complex information-extraction and text-classification stages. Due to practical requirements of efficiency and robustness, however, the use of text mining has so far been limited to anticipated information needs that can be fulfilled with rather simple, manually constructed pipelines.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
The future of information search is not browsing through tons of web pages or documents. In times of big data and the information overload of the internet, experts in the field agree that both everyday and enterprise search will gradually shift from only retrieving large numbers of texts that potentially contain relevant information to directly mining relevant information in these texts (Etzioni 2011; Kelly and Hamm 2013; Ananiadou et al. 2013). In this chapter, we first motivate the benefit of such large-scale text mining for today’s web search and big data analytics applications (Sect. 1.1). Then, we reveal the task specificity and the process complexity of analyzing natural language text as the main problems that prevent applications from performing text mining ad-hoc , i.e., immediately in response to a user query (Sect. 1.2). Section 1.3 points out how we propose to tackle these problems by improving the design , efficiency , and domain robustness of the pipelines of algorithms used for text analysis with artificial intelligence techniques. This leads to the contributions of the book at hand (Sect. 1.4).
Henning Wachsmuth

Chapter 2. Text Analysis Pipelines

Abstract
The understanding of natural language is one of the primary abilities that provide the basis for human intelligence. Since the invention of computers, people have thought about how to operationalize this ability in software applications (Jurafsky and Martin 2009). The rise of the internet in the 1990s then made explicit the practical need for automatically processing natural language in order to access relevant information . Search engines , as a solution, have revolutionalized the way we can find such information ad-hoc in large amounts of text (Manning et al. 2008). Until today, however, search engines excel in finding relevant texts rather than in understanding what information is relevant in the texts. Chapter 1 has proposed text mining as a means to achieve progress towards the latter, thereby making information search more intelligent. At the heart of every text mining application lies the analysis of text, mostly realized in the form of text analysis pipelines . In this chapter, we present the basics required to follow the approaches of this book to improve such pipelines for enabling text mining ad-hoc on large amounts of text as well as the state of the art in this respect.
Text mining combines techniques from information retrieval , natural language processing , and data mining . In Sect. 2.1, we first provide a focused overview of those techniques referred to in this book. Then, we define the text analysis processes and pipelines that we consider in our proposed approaches (Sect. 2.2). We evaluate the different approaches based on texts and pipelines from a number of case studies introduced in Sect. 2.3. Finally, Sect. 2.4 surveys and discusses related existing work in the broad context of ad-hoc large-scale text mining .
Henning Wachsmuth

Chapter 3. Pipeline Design

Abstract
The realization of a text analysis as a sequential execution of the algorithms in a pipeline does not mimic the way humans approach text analysis . Humans simultaneously investigate lexical, syntactic, semantic, and pragmatic clues in and about a text (McCallum 2009) while skimming over the text to fastly focus on the portions of text for a task (Duggan and Payne 2009). From a machine viewpoint, however, the of a text analysis into single executable steps is a prerequisite for identifying relevant information and their interdependencies. Until today, this and the subsequent of a text analysis are mostly made manually, which prevents the use of pipelines for tasks in ad-hoc text . Moreover, such pipelines do not focus on the task-relevant portions of input , making their execution slower than necessary (cf. Sect. 2.​2). In this chapter, we show that both parts of pipeline (i.e., and task-specific ) can be fully automated, once given adequate formalizations of text .
In Sect. 3.1, we discuss the optimality of text analysis and we introduce paradigms of an ideal pipeline and . For automatic , we model the expert knowledge underlying text analysis formally (Sect. 3.2). On this basis, we operationalize the cognitive skills of constructing pipelines through partial order  (Sect. 3.3). In our evaluation, the always takes near zero-time, thus enabling ad-hoc text . In Sect. 3.4, we then reinterpret text as the task to the portions of a text that contain relevant , i.e., to consistently imitate skimming. We realize this information-oriented view by equipping a pipeline with an input . Based on the dependencies between relevant information , the input determines for each employed algorithm in advance what portions of text its output is for (Sect. 3.5). Such an automatic truth of the relevant results in an optimal pipeline , since all unnecessary analyses of input texts are avoided. This does not only improve pipeline significantly in all our experiments, but it also creates the potential of pipeline that we target at in Chap. 4. In addition, it implies different ways of trading for , which we examine before (Sect. 3.6).
Henning Wachsmuth

Chapter 4. Pipeline Efficiency

Abstract
The importance of run-time efficiency is still often disregarded in approaches to text analysis tasks , limiting their use for industrial size text mining applications (Chiticariu et al. 2010b). Search engines avoid efficiency problems by analyzing input texts at indexing time (Cafarella et al. 2005). However, this is impossible in case of ad-hoc text analysis tasks . In order both to manage and to benefit from the ever increasing amounts of text in the world, we need not only scale existing approaches to the large (Agichtein 2005), but we also need to develop novel approaches at large scale (Glorot et al. 2011). Standard text analysis pipelines execute computationally expensive algorithms on most parts of the input texts, as we have seen in Sect. 3.​1. While one way to enable scalability is to rely on cheap but less effective algorithms only (Pantel et al. 2004; Al-Rfou’ and Skiena 2012), in this chapter we present ways to significantly speed up arbitrary pipelines by up to over one order of magnitude. As a consequence, more effective algorithms can be employed in large-scale text mining .
In particular, we observe that the schedule of a pipeline’s algorithms affects the pipeline’s efficiency , when the pipeline analyzes only relevant portions of text  (as achieved by our input control from Sect. 3.​5). In Sect. 4.1, we show that the optimal schedule can theoretically be found with dynamic programming . It depends on the run-times of the algorithms and the distribution of relevant information in the input texts. Especially the latter varies strongly between different collections and streams of texts, often making an optimal scheduling too expensive (Sect. 4.2). In practice, we thus perform scheduling with informed search on a sample of texts (Sect. 4.3). In cases where input texts are homogeneous in the distribution of relevant information , the approach reliably finds a near-optimal schedule according to our evaluation. In other cases, there is not one single optimal schedule  (Sect. 4.4). To optimize efficiency , a pipeline then needs to adapt to the input text at hand. Under high heterogeneity , such an adaptive scheduling works well by learning in a self-supervised manner what schedule is fastest for which text (Sect. 4.5). For large-scale text mining , a pipeline can finally be parallelized , as we outline in Sect. 4.6. The contribution of Chap. 4 to our overall approach is shown in Fig. 4.1.
Henning Wachsmuth

Chapter 5. Pipeline Robustness

Abstract
The ultimate purpose of text analysis pipelines is to infer new information from unknown input texts. To this end, the algorithms employed in pipelines are usually developed on known training texts from the anticipated domains of application  (cf. Sect. 2.​1). In many applications, however, the unknown texts significantly differ from the known texts, because a consideration of all possible domains within the development is practically infeasible (Blitzer et al. 2007). As a consequence, algorithms often fail to infer information effectively, especially when they rely on features of texts that are specific to the training domain . Such missing domain robustness constitutes a fundamental problem of text analysis  (Turmo et al. 2006; Daumé and Marcu 2006). The missing robustness of an algorithm directly reduces the robustness of a pipeline it is employed in. This in turn limits the benefit of pipelines in all search engines and big data analytics applications, where the domains of texts cannot be anticipated. In this chapter, we present first substantial results of an approach that improves robustness by relying on novel structure-based features that are invariant across domains .
Section 5.1 discusses how to achieve ideal domain independence in theory. Since the domain robustness problem is very diverse, we then focus on a specific type of text analysis tasks  (unlike in Chaps. 3 and 4). In particular, we consider tasks that deal with the classification of argumentative texts , like sentiment analysis , stance recognition , or automatic essay grading  (cf. Sect. 2.​1). In Sect. 5.2, we introduce a shallow model of such tasks, which captures the sequential overall structure of argumentative texts on the pragmatic level while abstracting from their content. For instance, we observe that review argumentation can be represented by the flow of local sentiment . Given the model , we demonstrate that common flow patternsexist in argumentative texts  (Sect. 5.3). Our hypothesis is that such patterns generalize well across domains . In Sect. 5.4, we learn common flow patterns with a supervised variant of clustering . Then, we use each pattern as a single feature for classifying argumentative texts from different domains . Our results for sentiment analysis indicate the robustness of modeling overall structure  (other tasks are left for future work). In addition, we can visually make results more intelligible based on the model  (Sect. 5.5). Altogether, this chapter realizes the overall analysis within the approach of this book, highlighted in Fig. 5.1. Both robustness and intelligibility benefit the use of pipelines in ad-hoc large-scale text mining .
Henning Wachsmuth

Chapter 6. Conclusion

Abstract
The ability of performing text mining ad-hoc in the large has the potential to essentially improve the way people find information today in terms of speed and quality, both in everyday web search and in big data analytics . More complex information needs can be fulfilled immediately, and previously hidden information can be accessed. At the heart of every text mining application, relevant information is inferred from natural language texts by a text analysis process . Mostly, such a process is realized in the form of a pipeline that sequentially executes a number of information extraction , text classification , and other natural language processing algorithms. As a matter of fact, text mining is studied in the field of computational linguistics , which we consider from a computer science perspective in this book.
Besides the fundamental challenge of inferring relevant information effectively, we have revealed the automatic design of a text analysis pipeline and the optimization of a pipeline’s run-time efficiency and domain robustness as major requirements for the enablement of ad-hoc large-scale text mining . Then, we have investigated the research question of how to exploit knowledge about a text analysis process and information obtained within the process to approach these requirements. To this end, we have developed different models and algorithms that can be employed to address information needs ad-hoc on large numbers of texts. The algorithms rely on classical and statistical techniques from artificial intelligence , namely, planning , truth maintenance , and informed search as well as supervised and self-supervised learning . All algorithms have been analyzed formally, implemented as software, and evaluated experimentally.
In Sect. 6.​1, we summarize our main findings and their contributions to different areas of computational linguistics . We outline that they have both scientific and practical impact on the state of the art in text mining . However, far from every problem of ad-hoc large-scale text mining has been solved or even approached at all in this book. In the words of Alan Turing, we can therefore already see plenty there that needs to be done in the given and in new directions of future research (Sect. 6.​2). Also, some of our main ideas may be beneficial for other problems from computer science or even from other fields of application, as we finally sketch at the end.
Henning Wachsmuth

Backmatter

Weitere Informationen