nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

2. Text Analysis Pipelines

verfasst von : Henning Wachsmuth

Erschienen in: Text Analysis Pipelines

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The understanding of natural language is one of the primary abilities that provide the basis for human intelligence. Since the invention of computers, people have thought about how to operationalize this ability in software applications (Jurafsky and Martin 2009). The rise of the internet in the 1990s then made explicit the practical need for automatically processing natural language in order to access relevant information. Search engines, as a solution, have revolutionalized the way we can find such information ad-hoc in large amounts of text (Manning et al. 2008). Until today, however, search engines excel in finding relevant texts rather than in understanding what information is relevant in the texts. Chapter 1 has proposed text mining as a means to achieve progress towards the latter, thereby making information search more intelligent. At the heart of every text mining application lies the analysis of text, mostly realized in the form of text analysis pipelines. In this chapter, we present the basics required to follow the approaches of this book to improve such pipelines for enabling text mining ad-hoc on large amounts of text as well as the state of the art in this respect.

Text mining combines techniques from information retrieval, natural language processing, and data mining. In Sect. 2.1, we first provide a focused overview of those techniques referred to in this book. Then, we define the text analysis processes and pipelines that we consider in our proposed approaches (Sect. 2.2). We evaluate the different approaches based on texts and pipelines from a number of case studies introduced in Sect. 2.3. Finally, Sect. 2.4 surveys and discusses related existing work in the broad context of ad-hoc large-scale text mining.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Introduction

Nächstes Kapitel Pipeline Design

Notice that, throughout this book, we assume that the reader has a more or less graduate-level background in computer science or similar.

Ananiadou and McNaught (2005) refer to the second step as information extraction. While we agree that information extraction is often the important part of this step, also other techniques from natural language processing play a role, as discussed later in this section.

Unlike us, some researchers do not distinguish between sentiment analysis and opinion mining, but they use these two terms interchangeably (Pang and Lee 2008).

Besides the references cited below, parts of the summary are inspired by the Coursera machine learning course, https://www.coursera.org/course/ml (accessed on June 15, 2015).

A discussion of common quality measures follows at the end of this section.

The question for what text analysis tasks to prefer a rule-based approach over a machine learning approach lies outside the scope of this book.

Throughout this book, we consider only features whose values come from a metric scale. Other features are transformed, e.g. a feature with values “red”, “green”, and “blue” can be represented by three 0/1-features, one for each value. All values are normalized to the same interval, namely [0,1], which benefits learning (Witten and Frank 2005).

The concrete features of a feature type can often be chosen automatically based on input data, as we do in our experiments, e.g. by taking only those words whose occurrence is above some threshold. Thereby, useless features that would introduce noise are excluded.

Techniques like feature selection and dimensionality reduction, which aim to reduce the set of considered features to improve generalizability and training efficiency among others (Hastie et al. 2009), are beyond the scope of this book.

Some existing text analysis algorithms that we employ rely on other classification algorithms, though, such as decision trees or artificial neural networks (Witten and Frank 2005).

Besides effectiveness and efficiency, we also investigate the robustness and intelligibility of text analysis in Chap. 5. Further details are given there.

The development of statistical approaches benefits from a balanced dataset (see above). This can be achieved through either undersampling minority classes or oversampling majority classes. Where needed, we mostly perform the latter using random duplicates.

Many text corpora already provide an according corpus split, including most of those that we use in our experiments (cf. Appendix C).

In some of our efficiency experiments, no parameter optimization takes place. We leave out the use of validation set in these cases, as pointed out where relevant.

Some exceptions to the truth of this assumption exist, of course. For instance, authorship attribution (see above) is expected to be often hard for humans.

A simple example is the interpretation of periods in tokenization and sentence splitting: Knowing sentence boundaries simplifies the determination of tokens with periods like abbreviations, but knowing the abbreviations also helps to determine sentence boundaries.

Some related work speaks about workflows rather than pipelines, such as (Shen et al. 2007). The term workflow is more general, also covering cascades where the input can take different paths. Indeed, such cascades are important in text analysis, e.g. when the sequence of algorithms to be executed depends on the language of the input text. From an execution viewpoint, however, we can see each taken path as a single pipeline in such cases.

While named differently, the way we represent pipelines and the algorithms they compose here largely conforms to their realization in standard software frameworks for text analysis, like Apache UIMA, http://uima.apache.org, accessed on June 15, 2015.

Iterative pipelines are to a certain extent related to compiler pipelines that include feedback loops (Buschmann et al. 1996). There, results from later compiler stages (say, semantic analysis) are used to resolve ambiguities in earlier stages (say, lexical analysis).

Google Knowledge Graph, http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html, accessed on June 15, 2015.

Accordingly, we omit to talk about infrastructural technologies for distributed computing, such as Apache Hadoop, http://hadoop.apache.org, accessed on June 15, 2015.

Titel: Text Analysis Pipelines
verfasst von: Henning Wachsmuth
Verlag: Springer International Publishing
Buch: Text Analysis Pipelines
Print ISBN: 978-3-319-25740-2

Electronic ISBN: 978-3-319-25741-9

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-3-319-25741-9_2

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Kryptowährungen/© gopixa / Getty Images / iStock, MG4 aus China auf dem Prüfstand im ADAC-Technik-Zentrum in Landsberg am Lech/© ADAC e.V., Chassis eines Elektrofahrzeugs/© chesky / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.