Skip to main content

2015 | OriginalPaper | Buchkapitel

3. Pipeline Design

verfasst von : Henning Wachsmuth

Erschienen in: Text Analysis Pipelines

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The realization of a text analysis as a sequential execution of the algorithms in a pipeline does not mimic the way humans approach text analysis . Humans simultaneously investigate lexical, syntactic, semantic, and pragmatic clues in and about a text (McCallum 2009) while skimming over the text to fastly focus on the portions of text for a task (Duggan and Payne 2009). From a machine viewpoint, however, the of a text analysis into single executable steps is a prerequisite for identifying relevant information and their interdependencies. Until today, this and the subsequent of a text analysis are mostly made manually, which prevents the use of pipelines for tasks in ad-hoc text . Moreover, such pipelines do not focus on the task-relevant portions of input , making their execution slower than necessary (cf. Sect. 2.​2). In this chapter, we show that both parts of pipeline (i.e., and task-specific ) can be fully automated, once given adequate formalizations of text .
In Sect. 3.1, we discuss the optimality of text analysis and we introduce paradigms of an ideal pipeline and . For automatic , we model the expert knowledge underlying text analysis formally (Sect. 3.2). On this basis, we operationalize the cognitive skills of constructing pipelines through partial order  (Sect. 3.3). In our evaluation, the always takes near zero-time, thus enabling ad-hoc text . In Sect. 3.4, we then reinterpret text as the task to the portions of a text that contain relevant , i.e., to consistently imitate skimming. We realize this information-oriented view by equipping a pipeline with an input . Based on the dependencies between relevant information , the input determines for each employed algorithm in advance what portions of text its output is for (Sect. 3.5). Such an automatic truth of the relevant results in an optimal pipeline , since all unnecessary analyses of input texts are avoided. This does not only improve pipeline significantly in all our experiments, but it also creates the potential of pipeline that we target at in Chap. 4. In addition, it implies different ways of trading for , which we examine before (Sect. 3.6).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The limitation to pipelines with only one algorithm for each information could be dropped by extending the definition of admissibility, which we leave out here for simplicity. Admissibility would then require that an algorithm \(A _i\!\in \!\mathbf A \) with required input information \(\mathbf C _i^{(in)}\) is not scheduled before any algorithm \(A _j\) for which \(\mathbf C _j^{(out)} \cap \mathbf C _i^{(in)} \not = \emptyset \) holds.
 
2
The second subproblem of the pipeline optimization has originally been presented in the context of the theory on optimal in Wachsmuth and Stein (2012).
 
3
The given steps revise the pipeline method from Wachsmuth et al. (2011). There, we named the last step “optimized scheduling”. We call it “optimal scheduling” here, since we discuss the theory behind pipeline than a practical approach. The difference between optimal and optimized is detailed in Chap. 4.
 
4
The given definition of filtering revises the definition of (Wachsmuth et al. 2011) where we used the term to denote the partial resulting from early .
 
5
Equation 3.4 assumes that the of each filtering \(A ^{(F)} \!\in \! \mathbf A ^*\) is zero. In Sect. 3.5, we offer evidence that the time required for is in fact almost neglible.
 
6
In Wachsmuth et al. (2011), we report on a small loss for \(\mathbf A _2\), which we there assume to emanate from noise of algorithms that operate on token . Meanwhile, we have found out that the actual reason was an implementation error, which is now fixed.
 
8
The notion of type and the modeled structures are in line with the software framework , http://​uima.​apache.​org, accessed on June 15, 2015.
 
9
We refer to features of in this chapter only. They should not be confused with the machine  (cf. Chap. 2.​1), which play a role in Chaps. 4 and 5.
 
10
Because of the changed definition of annotation task in Sect. 3.1, the definition of planning also slightly differs from the one in Wachsmuth et al. (2013a).
 
11
We assume that run-time of all in \(\mathbf A \) are given. In doubt, for each without a run-time , at least some default value can be used.
 
12
U-Compare version 2.0, http://​nactem.​ac.​uk/​ucompare, accessed on June 8, 2015.
 
13
Especially the term “entity” be counterintuitive for types that are not core information extraction , e.g. for Sentence or . In the end, however, the output of all pipelines is structured information that can be used in databases. Hence, it serves to fill the (entity) of a (relation) in the language of the classical MUC tasks (Chinchor et al. 1993). Such templates represent the table schemes of databases.
 
14
In Wachsmuth et al. (2013c), we associate the relevance of portions of and, hence, also the assignment of degrees of to relation instead of conjunctions. The resort to conjunctions can be seen as a generalization, because it allows us to determine the relevance of a portion of also with respect to an atomic entity only.
 
15
There is no clear connection between the specified degrees of and the of a text . In many applications, however, a higher will often be easier to achieve if text is performed only on small portions of text.
 
16
For complex relation like , the degree of of an inner conjunction may exceed the degree of an outer conjunction. For instance, in the example from Fig. 3.14, foundation (outer) might be sentence-wise, while (inner) could be resolved based on complete paragraphs. In such a case, with respect to the outer conjunction affects the to be resolved, but not the to be used for resolution.
 
17
We omit to distinguish between knowledge and information in this section to emphasize the connection between text and in artificial .
 
18
Unlike here, for space reasons we do not determine the relevant within updateScopes in Wachsmuth et al. (2013c), which requires to store the externally instead.
 
19
Here, we analyze the for the case that both the algorithms employed in a pipeline and the of these algorithms have been defined. In contrast, the examples at the beginning of Sect. 3.4 have suggested that the amount of text to be analyzed (and, hence, the ) may depend on the . The problem of finding the optimal under the given filtering is discussed in Chap. 4.
 
20
For the proof, it does not matter whether the instances of an information in \(\mathbf C ^{(out)}\) are used to generate , since no has taken place yet in this case and, hence, the whole text \(D \) can be after lines 1–3 of updateScopes.
 
21
The set \(\mathbf C ^{(out)}\) can be inferred from the so called result specification of an analysis engine, which automatically derives from the analysis engine’s descriptor file.
 
22
We provide no comparison to existing approaches, as these approaches do not compete with our approach, but rather can be integrated with it (cf. Sect. 3.6).
 
23
An exact evaluation of and is hardly feasible on the input texts, since the relation sought for are not annotated. Moreover, the given evaluation of is only fairly representative: In practice, many extractors do not look for cross-sentence and cross-paragraph at all. In such cases, remains unaffected by .
 
24
In Table 3.5, the number of characters for Paragraph is higher than for Text (20.02 M as opposed to 19.14 M), which seems counterintuitive. The reason behind is that the degree of Paragraph requires an additional application of the algorithm spa. A respective non-filtering pipeline for the paragraph level actually processes 22.97 million characters.
 
25
As shown in Fig. 3.22(a), the is already close to its maximum when clf is integrated after sto \(_2\), i.e., when are available, such as , bigrams, etc. So, more complex are not really needed in the end, which indicates that the of language is comparably easy on the given input texts.
 
26
While the extraction remains unaffected from the position of integration in the experiment, this is primarily due to the lack of false in the LFA-11 corpus only.
 
Metadaten
Titel
Pipeline Design
verfasst von
Henning Wachsmuth
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25741-9_3

Neuer Inhalt