Skip to main content
Top

2018 | OriginalPaper | Chapter

51. Reproducible Research in Document Analysis and Recognition

Authors : Jorge Ramón Fonseca Cacho, Kazem Taghva

Published in: Information Technology - New Generations

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With reproducible research becoming a de facto standard in computational sciences, many approaches have been explored to enable researchers in other disciplines to adopt this standard. In this paper, we explore the importance of reproducible research in the field of document analysis and recognition and in the Computer Science field as a whole. First, we report on the difficulties that one can face in trying to reproduce research in the current publication standards. These difficulties for a large percentage of research may include missing raw or original data, a lack of tidied up version of the data, no source code available, or lacking the software to run the experiment. Furthermore, even when we have all these tools available, we found it was not a trivial task to replicate the research due to lack of documentation and deprecated dependencies. In this paper, we offer a solution to these reproducible research issues by utilizing container technologies such as Docker. As an example, we revisit the installation and execution of OCRSpell which we reported on and implemented in 1994. While the code for OCRSpell is freely available on github, we continuously get emails from individuals who have difficulties compiling and using it in modern hardware platforms. We walk through the development of an OCRSpell Docker container for creating an image, uploading such an image, and enabling others to easily run this program by simply downloading the image and running the container.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference R.D. Peng, Reproducible research in computational science. Science 334(6060), 1226–1227 (2011) R.D. Peng, Reproducible research in computational science. Science 334(6060), 1226–1227 (2011)
2.
go back to reference G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10), e1003285 (2013) G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10), e1003285 (2013)
3.
go back to reference K. Ram, Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8(1) 7 (2013) K. Ram, Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8(1) 7 (2013)
4.
go back to reference H. Wickham et al., Tidy data. J. Stat. Softw. 59(10), 1–23 (2014) H. Wickham et al., Tidy data. J. Stat. Softw. 59(10), 1–23 (2014)
5.
go back to reference C. Collberg, T. Proebsting, G. Moraila, A. Shankaran, Z. Shi, A.M. Warren, Measuring reproducibility in computer systems research, Technical report, 2014 C. Collberg, T. Proebsting, G. Moraila, A. Shankaran, Z. Shi, A.M. Warren, Measuring reproducibility in computer systems research, Technical report, 2014
6.
go back to reference N. Barnes, Publish your computer code: it is good enough. Nature 467(7317), 753 (2010) N. Barnes, Publish your computer code: it is good enough. Nature 467(7317), 753 (2010)
7.
go back to reference J.P. Ioannidis, Why most published research findings are false. PLos Med 2(8), e124 (2005) J.P. Ioannidis, Why most published research findings are false. PLos Med 2(8), e124 (2005)
8.
go back to reference T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, B.T. Moyers, S. Renaut, D.J. Rennison et al., Mandated data archiving greatly improves access to research data. FASEB J 27(4), 1304–1308 (2013)CrossRef T.H. Vines, R.L. Andrew, D.G. Bock, M.T. Franklin, K.J. Gilbert, N.C. Kane, J.-S. Moore, B.T. Moyers, S. Renaut, D.J. Rennison et al., Mandated data archiving greatly improves access to research data. FASEB J 27(4), 1304–1308 (2013)CrossRef
10.
go back to reference J.T. Leek, R.D. Peng, Opinion: reproducible research can still be wrong: Adopting a prevention approach. Proc. Natl. Acad. Sci. 112(6), 1645–1646 (2015)CrossRef J.T. Leek, R.D. Peng, Opinion: reproducible research can still be wrong: Adopting a prevention approach. Proc. Natl. Acad. Sci. 112(6), 1645–1646 (2015)CrossRef
11.
go back to reference G. Marcus, E. Davis, Eight (no, nine!) problems with big data. New York Times 6(04), 2014 (2014) G. Marcus, E. Davis, Eight (no, nine!) problems with big data. New York Times 6(04), 2014 (2014)
12.
go back to reference C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015)CrossRef C. Boettiger, An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015)CrossRef
13.
go back to reference I. Jimenez, C. Maltzahn, A. Moody, K. Mohror, J. Lofstead, R. Arpaci-Dusseau, A. Arpaci-Dusseau, The role of container technology in reproducible computer systems research, in 2015 IEEE International Conference on Cloud Engineering (IC2E) (IEEE, New York, 2015), pp. 379–385 I. Jimenez, C. Maltzahn, A. Moody, K. Mohror, J. Lofstead, R. Arpaci-Dusseau, A. Arpaci-Dusseau, The role of container technology in reproducible computer systems research, in 2015 IEEE International Conference on Cloud Engineering (IC2E) (IEEE, New York, 2015), pp. 379–385
14.
go back to reference L.-H. Hung, D. Kristiyanto, S.B. Lee, K.Y. Yeung, Guidock: using docker containers with a common graphics user interface to address the reproducibility of research. PloS One 11(4), e0152686 (2016) L.-H. Hung, D. Kristiyanto, S.B. Lee, K.Y. Yeung, Guidock: using docker containers with a common graphics user interface to address the reproducibility of research. PloS One 11(4), e0152686 (2016)
15.
go back to reference P. Di Tommaso, E. Palumbo, M. Chatzou, P. Prieto, M.L. Heuer, C. Notredame, The impact of docker containers on the performance of genomic pipelines. PeerJ 3, e1273 (2015)CrossRef P. Di Tommaso, E. Palumbo, M. Chatzou, P. Prieto, M.L. Heuer, C. Notredame, The impact of docker containers on the performance of genomic pipelines. PeerJ 3, e1273 (2015)CrossRef
16.
go back to reference D. Hládek, J. Staš, S. Ondáš, J. Juhár, L. Kovács, Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76(22), 24549–24567 (2017)CrossRef D. Hládek, J. Staš, S. Ondáš, J. Juhár, L. Kovács, Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76(22), 24549–24567 (2017)CrossRef
17.
go back to reference K. Taghva, E. Stofsky, Ocrspell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)CrossRef K. Taghva, E. Stofsky, Ocrspell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)CrossRef
18.
go back to reference K. Taghva, T. Nartker, J. Borsack, Information access in the presence of OCR errors, in Proceedings of the 1st ACM Workshop on Hardcopy Document Processing (ACM, New York, 2004), pp. 1–8 K. Taghva, T. Nartker, J. Borsack, Information access in the presence of OCR errors, in Proceedings of the 1st ACM Workshop on Hardcopy Document Processing (ACM, New York, 2004), pp. 1–8
19.
go back to reference P. Belmann, J. Dröge, A. Bremges, A.C. McHardy, A. Sczyrba, M.D. Barton, Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4(1), 47 (2015) P. Belmann, J. Dröge, A. Bremges, A.C. McHardy, A. Sczyrba, M.D. Barton, Bioboxes: standardised containers for interchangeable bioinformatics software. Gigascience 4(1), 47 (2015)
20.
go back to reference A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre, Algorun, a docker-based packaging system for platform-agnostic implemented algorithms. Bioinformatics 32, btw120 (2016) A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre, Algorun, a docker-based packaging system for platform-agnostic implemented algorithms. Bioinformatics 32, btw120 (2016)
Metadata
Title
Reproducible Research in Document Analysis and Recognition
Authors
Jorge Ramón Fonseca Cacho
Kazem Taghva
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-77028-4_51

Premium Partner