Skip to main content

2015 | OriginalPaper | Buchkapitel

2. Text Analysis Pipelines

verfasst von : Henning Wachsmuth

Erschienen in: Text Analysis Pipelines

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The understanding of natural language is one of the primary abilities that provide the basis for human intelligence. Since the invention of computers, people have thought about how to operationalize this ability in software applications (Jurafsky and Martin 2009). The rise of the internet in the 1990s then made explicit the practical need for automatically processing natural language in order to access relevant information. Search engines, as a solution, have revolutionalized the way we can find such information ad-hoc in large amounts of text (Manning et al. 2008). Until today, however, search engines excel in finding relevant texts rather than in understanding what information is relevant in the texts. Chapter 1 has proposed text mining as a means to achieve progress towards the latter, thereby making information search more intelligent. At the heart of every text mining application lies the analysis of text, mostly realized in the form of text analysis pipelines. In this chapter, we present the basics required to follow the approaches of this book to improve such pipelines for enabling text mining ad-hoc on large amounts of text as well as the state of the art in this respect.
Text mining combines techniques from information retrieval, natural language processing, and data mining. In Sect. 2.1, we first provide a focused overview of those techniques referred to in this book. Then, we define the text analysis processes and pipelines that we consider in our proposed approaches (Sect. 2.2). We evaluate the different approaches based on texts and pipelines from a number of case studies introduced in Sect. 2.3. Finally, Sect. 2.4 surveys and discusses related existing work in the broad context of ad-hoc large-scale text mining.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Notice that, throughout this book, we assume that the reader has a more or less graduate-level background in computer science or similar.
 
2
Ananiadou and McNaught (2005) refer to the second step as information extraction. While we agree that information extraction is often the important part of this step, also other techniques from natural language processing play a role, as discussed later in this section.
 
3
Unlike us, some researchers do not distinguish between sentiment analysis and opinion mining, but they use these two terms interchangeably (Pang and Lee 2008).
 
4
Besides the references cited below, parts of the summary are inspired by the Coursera machine learning course, https://​www.​coursera.​org/​course/​ml (accessed on June 15, 2015).
 
5
A discussion of common quality measures follows at the end of this section.
 
6
The question for what text analysis tasks to prefer a rule-based approach over a machine learning approach lies outside the scope of this book.
 
7
Throughout this book, we consider only features whose values come from a metric scale. Other features are transformed, e.g. a feature with values “red”, “green”, and “blue” can be represented by three 0/1-features, one for each value. All values are normalized to the same interval, namely [0,1], which benefits learning (Witten and Frank 2005).
 
8
The concrete features of a feature type can often be chosen automatically based on input data, as we do in our experiments, e.g. by taking only those words whose occurrence is above some threshold. Thereby, useless features that would introduce noise are excluded.
 
9
Techniques like feature selection and dimensionality reduction, which aim to reduce the set of considered features to improve generalizability and training efficiency among others (Hastie et al. 2009), are beyond the scope of this book.
 
10
Some existing text analysis algorithms that we employ rely on other classification algorithms, though, such as decision trees or artificial neural networks (Witten and Frank 2005).
 
11
Besides effectiveness and efficiency, we also investigate the robustness and intelligibility of text analysis in Chap. 5. Further details are given there.
 
12
The development of statistical approaches benefits from a balanced dataset (see above). This can be achieved through either undersampling minority classes or oversampling majority classes. Where needed, we mostly perform the latter using random duplicates.
 
13
Many text corpora already provide an according corpus split, including most of those that we use in our experiments (cf. Appendix C).
 
14
In some of our efficiency experiments, no parameter optimization takes place. We leave out the use of validation set in these cases, as pointed out where relevant.
 
15
Some exceptions to the truth of this assumption exist, of course. For instance, authorship attribution (see above) is expected to be often hard for humans.
 
16
A simple example is the interpretation of periods in tokenization and sentence splitting: Knowing sentence boundaries simplifies the determination of tokens with periods like abbreviations, but knowing the abbreviations also helps to determine sentence boundaries.
 
17
Some related work speaks about workflows rather than pipelines, such as (Shen et al. 2007). The term workflow is more general, also covering cascades where the input can take different paths. Indeed, such cascades are important in text analysis, e.g. when the sequence of algorithms to be executed depends on the language of the input text. From an execution viewpoint, however, we can see each taken path as a single pipeline in such cases.
 
18
While named differently, the way we represent pipelines and the algorithms they compose here largely conforms to their realization in standard software frameworks for text analysis, like Apache UIMA, http://​uima.​apache.​org, accessed on June 15, 2015.
 
19
Iterative pipelines are to a certain extent related to compiler pipelines that include feedback loops (Buschmann et al. 1996). There, results from later compiler stages (say, semantic analysis) are used to resolve ambiguities in earlier stages (say, lexical analysis).
 
21
Accordingly, we omit to talk about infrastructural technologies for distributed computing, such as Apache Hadoop, http://​hadoop.​apache.​org, accessed on June 15, 2015.
 
Metadaten
Titel
Text Analysis Pipelines
verfasst von
Henning Wachsmuth
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25741-9_2

Neuer Inhalt