Skip to main content
Top
Published in: Datenbank-Spektrum 1/2019

10-12-2018 | Schwerpunktbeitrag

A Big Data Case Study in Digital Humanities

Creating a Performance Benchmark for Canonical Text Services

Authors: Gerhard Heyer, Jochen Tiepmar

Published in: Datenbank-Spektrum | Issue 1/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

While the volume of primary data in the text oriented humanities is small in comparison to the terabytes that are nowadays standard in Big Data applications, secondary data that are the result of scholarly annotations require a fine-grained hierarchical structure based reference model for primary data. The paper provides an attempt for a reusable performance benchmark for Canonical Text Services, a service to access and retrieve text content and structural meta information for hierarchically structured texts, and shows how it can be used to evaluate the technical performance of such a system.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Show more products
Footnotes
1
The result from MySQL’s AUTO_INCREMENT is not necessarily gap-free.
 
2
SELECT urn WHERE urn LIKE BINARY “urn:cts:pbc:bible.parallel.eng.kingjames:2.1.%”.
 
3
The dot . and the colon :.
 
4
SELECT URN WHERE urn LIKE “[URN]%” AND urn LIKE BINARY “[URN]%”.
 
5
The text column has been indexed due to the implementation of the full-text search described in [14], but this additional index is not required for the CTS index.
 
6
JAVA was chosen because of its widespread support and its uncomplicated use as web applications(Servlets).
 
7
The CTS URN urn:cts:pbc:bible.parallel.eng.kingjames:2.1.2 is 46 characters long.
 
8
A B-Tree that is processed in such a way that input and output are equal to that of a trie.
 
9
Including the speed of the network itself, possible internal proxy redirects, additional server traffic, and even the performance that a specific browser software provides.
 
10
See https://developer.ted.com/.
 
11
The incorrect document-level meta information in the TED subtitle transcripts cannot be repaired because the API is closed. This is not a problem for a performance benchmark because its purpose is not to validate the content.
 
12
pbc:657,936, dta:16,438,119, ted:15,292,408.
 
13
dta2, dta3, ted2, dta1, ted3, dta4, ted1, ted4, pbc.
 
14
For example, Sentence 1 in Chap. 2 to Sentence 3 in Chap. 4.
 
15
JAVA’s default StringTokenizer is used with space, tab, newline, carriage-return, and form-feed as hardcoded delimiters. If these are not applicable to a specific language, then the sub-passage is the non-tokenized text, which is also a correct request.
 
16
AuthenticAMD Common KVM Processor.
 
17
Debian 8.5 3.167-ckt25-2 /2016-04-08) x86_64, codename Jessie.
 
18
AMD Opteron 6234.
 
Literature
1.
go back to reference Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1 Blackwell C, Roughan C, Smith DN (2017) Citation and alignment: scholarship outside and inside the codex. Manuscript Studies, Bd. 1
3.
go back to reference Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, Massachusetts Corman TH, Leiserson CE, Rivest RLS, Stein C (2001) Introduction to Algorithms, 2. Aufl. MIT Press, Cambridge, Massachusetts
4.
go back to reference Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, California Fielding T (2000) Architectural styles and design of network-based software architectures. University of California, Oakland, California
5.
go back to reference Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2 Geyken A, Haaf S, Jurish B, Schulz M, Steinmann J, Thomas C, Wiegand F (2011) Das Deutsche Textarchiv: Vom historischen Korpus zum aktiven Archiv. In: Digitale Wissenschaft Stand und Entwicklung digital vernetzter Forschung in Deutschland, Bd. 2
7.
go back to reference Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LREC Mayer T, Cysouw M (2014) Creating a massively parallel Bible corpus. In: Proceedings of LREC
9.
go back to reference Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163CrossRef Nah FF-H (2004) A study on tolerable waiting time: how long are Web users willing to wait? Behav Inf Technol 23(3):153–163CrossRef
10.
go back to reference Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. Istanbul Schneider R (2012) Evaluating DBMS-based access strategies to very large multi-layer corpora. Proceedings of the LREC-2012 Workshop on Challenges in the Management of Large Corpora. Istanbul
13.
go back to reference Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis) Tiepmar J (2018) Implementation and evaluation of the canonical text service protocol as part of a research infrastructure in the digital humanities. Leipzig University, Leipzig (Phd Thesis)
14.
go back to reference Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature StudiesCrossRef Tiepmar J, Heyer G (2017) An overview of canonical text services. Linguistics and Literature StudiesCrossRef
15.
go back to reference Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) Tiepmar J, Teichmann C, Heyer G, Berti M, Crane G (2013) A new Implementation for Canonical Text Services. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
16.
go back to reference Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921CrossRef Williamson DF, Parker RA, Kendrick JS (1989) The box plot: a simple visual method to interpret data. Ann Intern Med 110:916–921CrossRef
Metadata
Title
A Big Data Case Study in Digital Humanities
Creating a Performance Benchmark for Canonical Text Services
Authors
Gerhard Heyer
Jochen Tiepmar
Publication date
10-12-2018
Publisher
Springer Berlin Heidelberg
Published in
Datenbank-Spektrum / Issue 1/2019
Print ISSN: 1618-2162
Electronic ISSN: 1610-1995
DOI
https://doi.org/10.1007/s13222-018-00302-7

Other articles of this Issue 1/2019

Datenbank-Spektrum 1/2019 Go to the issue

Community

News

Editorial

Editorial

Premium Partner