Skip to main content
Top

2015 | OriginalPaper | Chapter

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Authors : Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman

Published in: Research and Advanced Technology for Digital Libraries

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Acerbi, A., Lampos, V., Garnett, P., Bentley, R.A.: The expression of emotions in 20th century books. PLoS ONE 8(3), e59030 (2013)CrossRef Acerbi, A., Lampos, V., Garnett, P., Bentley, R.A.: The expression of emotions in 20th century books. PLoS ONE 8(3), e59030 (2013)CrossRef
2.
go back to reference Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediOCRe? In: Jancsary, J. (ed.) Proceedings of KONVENS 2012, LThist 2012 Workshop, pp. 401–409. ÖGAI, September 2012 Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediOCRe? In: Jancsary, J. (ed.) Proceedings of KONVENS 2012, LThist 2012 Workshop, pp. 401–409. ÖGAI, September 2012
3.
go back to reference Bingham, A.: The digitization of newspaper archives: opportunities and challenges for historians. Twentieth Century Br. Hist. 21(2), 225–231 (2010)CrossRef Bingham, A.: The digitization of newspaper archives: opportunities and challenges for historians. Twentieth Century Br. Hist. 21(2), 225–231 (2010)CrossRef
4.
go back to reference Bron, M.; Exploration and contextualization through interaction and concepts. Ph.D. Thesis (2013) Bron, M.; Exploration and contextualization through interaction and concepts. Ph.D. Thesis (2013)
5.
go back to reference Brown, C.D.: Straddling the humanities and social sciences: the research process of music scholars. Libr. Inf. Sci. Res. 24(1), 73–94 (2002)CrossRef Brown, C.D.: Straddling the humanities and social sciences: the research process of music scholars. Libr. Inf. Sci. Res. 24(1), 73–94 (2002)CrossRef
6.
go back to reference Cohen, D.J., Rosenzweig, R.: Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, vol. 28. University of Pennsylvania Press, Philadelphia (2006) Cohen, D.J., Rosenzweig, R.: Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web, vol. 28. University of Pennsylvania Press, Philadelphia (2006)
7.
go back to reference Croft, W.B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. Technical report, Amherst, MA, USA (1993) Croft, W.B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. Technical report, Amherst, MA, USA (1993)
8.
go back to reference Fuhr, N., Hansen, P., Mabe, M., Micsik, A., Sølvberg, I.T.: Digital libraries: a generic classification and evaluation scheme. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 187–199. Springer, Heidelberg (2001) CrossRef Fuhr, N., Hansen, P., Mabe, M., Micsik, A., Sølvberg, I.T.: Digital libraries: a generic classification and evaluation scheme. In: Constantopoulos, P., Sølvberg, I.T. (eds.) ECDL 2001. LNCS, vol. 2163, pp. 187–199. Springer, Heidelberg (2001) CrossRef
9.
go back to reference Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 15(3/4) (2009) Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 15(3/4) (2009)
10.
go back to reference Holley, R.: Many hands make light work: public collaborative OCR text correction in Australian Historic Newspapers. Technical report, National Library of Australia, March 2009 Holley, R.: Many hands make light work: public collaborative OCR text correction in Australian Historic Newspapers. Technical report, National Library of Australia, March 2009
11.
go back to reference Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J. et al.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceedings of the 80th IFLA General Conference and Assembly, IFLA World Library and Information Congress (2014) Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J. et al.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceedings of the 80th IFLA General Conference and Assembly, IFLA World Library and Information Congress (2014)
12.
go back to reference Klijn, E.: The current state-of-art in newspaper digitization a market perspective. D-Lib Mag. 14, January 2008 Klijn, E.: The current state-of-art in newspaper digitization a market perspective. D-Lib Mag. 14, January 2008
13.
go back to reference Mittendorf, E., Schäuble, P.: Information retrieval can cope with many errors. Inf. Retr. 3(3), 189–216 (2000)CrossRefMATH Mittendorf, E., Schäuble, P.: Information retrieval can cope with many errors. Inf. Retr. 3(3), 189–216 (2000)CrossRefMATH
14.
go back to reference Nicholson, B.: Counting culture; or, how to read Victorian newspapers from a distance. J. Victorian Cult. 17(2), 238–246 (2012)CrossRef Nicholson, B.: Counting culture; or, how to read Victorian newspapers from a distance. J. Victorian Cult. 17(2), 238–246 (2012)CrossRef
15.
go back to reference Strange, C., McNamara, D., Wodak, J., Wood, I.: Mining for the meanings of a murder: the impact of OCR quality on the use of digitized historical newspapers. Digital Humanit. Q. 8(1) (2014) Strange, C., McNamara, D., Wodak, J., Wood, I.: Mining for the meanings of a murder: the impact of OCR quality on the use of digitized historical newspapers. Digital Humanit. Q. 8(1) (2014)
16.
go back to reference Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 348–357. Springer, Heidelberg (2006) CrossRef Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 348–357. Springer, Heidelberg (2006) CrossRef
17.
go back to reference Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)CrossRef Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)CrossRef
18.
go back to reference Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), 1082–9873 (2009) Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. D-Lib Mag. 15(7/8), 1082–9873 (2009)
19.
go back to reference Weymann, A., Luna Orozco, R.A., Mueller, C., Nickolay, B., Schneider, J., Barzik, K.: Einführung in die Digitalisierung von gedrucktem Kulturgut - Ein Handbuch für Einsteiger. Ibero-American Institute (Berlin) (2010) Weymann, A., Luna Orozco, R.A., Mueller, C., Nickolay, B., Schneider, J., Barzik, K.: Einführung in die Digitalisierung von gedrucktem Kulturgut - Ein Handbuch für Einsteiger. Ibero-American Institute (Berlin) (2010)
20.
go back to reference Xie, H.I.: Evaluation of digital libraries: criteria and problems from users’ perspectives. Libr. Inf. Sci. Res. 28(3), 433–452 (2006)CrossRef Xie, H.I.: Evaluation of digital libraries: criteria and problems from users’ perspectives. Libr. Inf. Sci. Res. 28(3), 433–452 (2006)CrossRef
21.
go back to reference Xie, H.I.: Users’ evaluation of digital libraries (DLs): Their uses, their criteria, and their assessment. Inf. Process. Manage. 44(3), 1346–1373 (2008)CrossRef Xie, H.I.: Users’ evaluation of digital libraries (DLs): Their uses, their criteria, and their assessment. Inf. Process. Manage. 44(3), 1346–1373 (2008)CrossRef
Metadata
Title
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Authors
Myriam C. Traub
Jacco van Ossenbruggen
Lynda Hardman
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-24592-8_19

Premium Partner