Skip to main content

2018 | OriginalPaper | Buchkapitel

51. Reproducible Research in Document Analysis and Recognition

verfasst von : Jorge Ramón Fonseca Cacho, Kazem Taghva

Erschienen in: Information Technology - New Generations

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With reproducible research becoming a de facto standard in computational sciences, many approaches have been explored to enable researchers in other disciplines to adopt this standard. In this paper, we explore the importance of reproducible research in the field of document analysis and recognition and in the Computer Science field as a whole. First, we report on the difficulties that one can face in trying to reproduce research in the current publication standards. These difficulties for a large percentage of research may include missing raw or original data, a lack of tidied up version of the data, no source code available, or lacking the software to run the experiment. Furthermore, even when we have all these tools available, we found it was not a trivial task to replicate the research due to lack of documentation and deprecated dependencies. In this paper, we offer a solution to these reproducible research issues by utilizing container technologies such as Docker. As an example, we revisit the installation and execution of OCRSpell which we reported on and implemented in 1994. While the code for OCRSpell is freely available on github, we continuously get emails from individuals who have difficulties compiling and using it in modern hardware platforms. We walk through the development of an OCRSpell Docker container for creating an image, uploading such an image, and enabling others to easily run this program by simply downloading the image and running the container.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat R.D. Peng, Reproducible research in computational science. Science 334(6060), 1226–1227 (2011) R.D. Peng, Reproducible research in computational science. Science 334(6060), 1226–1227 (2011)
2.
Zurück zum Zitat G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10), e1003285 (2013) G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10), e1003285 (2013)
3.
Zurück zum Zitat K. Ram, Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8(1) 7 (2013) K. Ram, Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8(1) 7 (2013)
4.
Zurück zum Zitat H. Wickham et al., Tidy data. J. Stat. Softw. 59(10), 1–23 (2014) H. Wickham et al., Tidy data. J. Stat. Softw. 59(10), 1–23 (2014)
5.
Zurück zum Zitat C. Collberg, T. Proebsting, G. Moraila, A. Shankaran, Z. Shi, A.M. Warren, Measuring reproducibility in computer systems research, Technical report, 2014 C. Collberg, T. Proebsting, G. Moraila, A. Shankaran, Z. Shi, A.M. Warren, Measuring reproducibility in computer systems research, Technical report, 2014
6.
Zurück zum Zitat N. Barnes, Publish your computer code: it is good enough. Nature 467(7317), 753 (2010) N. Barnes, Publish your computer code: it is good enough. Nature 467(7317), 753 (2010)
7.
Zurück zum Zitat J.P. Ioannidis, Why most published research findings are false. PLos Med 2(8), e124 (2005) J.P. Ioannidis, Why most published research findings are false. PLos Med 2(8), e124 (2005)
8.
Zurück zum Zitat T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, B.T. Moyers, S. Renaut, D.J. Rennison et al., Mandated data archiving greatly improves access to research data. FASEB J 27(4), 1304–1308 (2013)CrossRef T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, B.T. Moyers, S. Renaut, D.J. Rennison et al., Mandated data archiving greatly improves access to research data. FASEB J 27(4), 1304–1308 (2013)CrossRef
10.
Zurück zum Zitat J.T. Leek, R.D. Peng, Opinion: reproducible research can still be wrong: Adopting a prevention approach. Proc. Natl. Acad. Sci. 112(6), 1645–1646 (2015)CrossRef J.T. Leek, R.D. Peng, Opinion: reproducible research can still be wrong: Adopting a prevention approach. Proc. Natl. Acad. Sci. 112(6), 1645–1646 (2015)CrossRef
11.
Zurück zum Zitat G. Marcus, E. Davis, Eight (no, nine!) problems with big data. New York Times 6(04), 2014 (2014) G. Marcus, E. Davis, Eight (no, nine!) problems with big data. New York Times 6(04), 2014 (2014)
12.
Zurück zum Zitat C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015)CrossRef C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015)CrossRef
13.
Zurück zum Zitat I. Jimenez, C. Maltzahn, A. Moody, K. Mohror, J. Lofstead, R. Arpaci-Dusseau, A. Arpaci-Dusseau, The role of container technology in reproducible computer systems research, in 2015 IEEE International Conference on Cloud Engineering (IC2E) (IEEE, New York, 2015), pp. 379–385 I. Jimenez, C. Maltzahn, A. Moody, K. Mohror, J. Lofstead, R. Arpaci-Dusseau, A. Arpaci-Dusseau, The role of container technology in reproducible computer systems research, in 2015 IEEE International Conference on Cloud Engineering (IC2E) (IEEE, New York, 2015), pp. 379–385
14.
Zurück zum Zitat L.-H. Hung, D. Kristiyanto, S.B. Lee, K.Y. Yeung, Guidock: using docker containers with a common graphics user interface to address the reproducibility of research. PloS One 11(4), e0152686 (2016) L.-H. Hung, D. Kristiyanto, S.B. Lee, K.Y. Yeung, Guidock: using docker containers with a common graphics user interface to address the reproducibility of research. PloS One 11(4), e0152686 (2016)
15.
Zurück zum Zitat P. Di Tommaso, E. Palumbo, M. Chatzou, P. Prieto, M.L. Heuer, C. Notredame, The impact of docker containers on the performance of genomic pipelines. PeerJ 3, e1273 (2015)CrossRef P. Di Tommaso, E. Palumbo, M. Chatzou, P. Prieto, M.L. Heuer, C. Notredame, The impact of docker containers on the performance of genomic pipelines. PeerJ 3, e1273 (2015)CrossRef
16.
Zurück zum Zitat D. Hládek, J. Staš, S. Ondáš, J. Juhár, L. Kovács, Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76(22), 24549–24567 (2017)CrossRef D. Hládek, J. Staš, S. Ondáš, J. Juhár, L. Kovács, Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76(22), 24549–24567 (2017)CrossRef
17.
Zurück zum Zitat K. Taghva, E. Stofsky, Ocrspell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)CrossRef K. Taghva, E. Stofsky, Ocrspell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)CrossRef
18.
Zurück zum Zitat K. Taghva, T. Nartker, J. Borsack, Information access in the presence of OCR errors, in Proceedings of the 1st ACM Workshop on Hardcopy Document Processing (ACM, New York, 2004), pp. 1–8 K. Taghva, T. Nartker, J. Borsack, Information access in the presence of OCR errors, in Proceedings of the 1st ACM Workshop on Hardcopy Document Processing (ACM, New York, 2004), pp. 1–8
19.
Zurück zum Zitat P. Belmann, J. Dröge, A. Bremges, A.C. McHardy, A. Sczyrba, M.D. Barton, Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4(1), 47 (2015) P. Belmann, J. Dröge, A. Bremges, A.C. McHardy, A. Sczyrba, M.D. Barton, Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4(1), 47 (2015)
20.
Zurück zum Zitat A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre, Algorun, a docker-based packaging system for platform-agnostic implemented algorithms. Bioinformatics 32, btw120 (2016) A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre, Algorun, a docker-based packaging system for platform-agnostic implemented algorithms. Bioinformatics 32, btw120 (2016)
Metadaten
Titel
Reproducible Research in Document Analysis and Recognition
verfasst von
Jorge Ramón Fonseca Cacho
Kazem Taghva
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-77028-4_51

Premium Partner