Skip to main content

2020 | OriginalPaper | Buchkapitel

Assessing the Impact of OCR Errors in Information Retrieval

verfasst von : Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, Viviane P. Moreira

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rates in an initially clean IR collection. Our results showed that significant impacts are noticed starting at a 5% error rate. Furthermore, stemming has proven to make systems more robust to errors.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Beitzel, S.M., Jensen, E.C., Grossman, D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of the Symposium on Document Image Understanding Technologies (2003) Beitzel, S.M., Jensen, E.C., Grossman, D.A.: A survey of retrieval strategies for OCR text collections. In: Proceedings of the Symposium on Document Image Understanding Technologies (2003)
2.
Zurück zum Zitat Chiron, G., Doucet, A., Coustaty, M., Moreux, J.: ICDAR 2017 competition on post-OCR text correction. In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1423–1428 (2017) Chiron, G., Doucet, A., Coustaty, M., Moreux, J.: ICDAR 2017 competition on post-OCR text correction. In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1423–1428 (2017)
3.
Zurück zum Zitat Croft, W.B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Symposium of Document Analysis and Information Retrieval (1994) Croft, W.B., Harding, S., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Symposium of Document Analysis and Information Retrieval (1994)
4.
Zurück zum Zitat Droettboom, M.: Correcting broken characters in the recognition of historical printed documents. In: Proceedings 2003 Joint Conference on Digital Libraries, pp. 364–366, May 2003 Droettboom, M.: Correcting broken characters in the recognition of historical printed documents. In: Proceedings 2003 Joint Conference on Digital Libraries, pp. 364–366, May 2003
5.
Zurück zum Zitat Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH 2014), pp. 45–51 (2014) Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH 2014), pp. 45–51 (2014)
6.
Zurück zum Zitat Grimes, S.: Unstructured data and the 80 percent rule, p. 10. Carabridge Bridgepoints (2008) Grimes, S.: Unstructured data and the 80 percent rule, p. 10. Carabridge Bridgepoints (2008)
8.
Zurück zum Zitat Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRef Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRef
9.
Zurück zum Zitat Nguyen, T., Jatowt, A., Coustaty, M., Nguyen, N., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38, June 2019 Nguyen, T., Jatowt, A., Coustaty, M., Nguyen, N., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38, June 2019
11.
Zurück zum Zitat Peters, C., Braschler, M.: European research letter: cross-language system evaluation: the CLEF campaigns. J. Am. Soc. Inf. Sci. Technol. 52(12), 1067–1072 (2001)CrossRef Peters, C., Braschler, M.: European research letter: cross-language system evaluation: the CLEF campaigns. J. Am. Soc. Inf. Sci. Technol. 52(12), 1067–1072 (2001)CrossRef
12.
Zurück zum Zitat Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: International Conference on Document Analysis and Recognition (ICDAR) (2019) Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: International Conference on Document Analysis and Recognition (ICDAR) (2019)
13.
Zurück zum Zitat Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14(1), 64–93 (1996)CrossRef Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14(1), 64–93 (1996)CrossRef
14.
Zurück zum Zitat Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness: lessons learned from assessing the OCR accuracy of the British library’s 19th century online newspaper archive. D-Lib Mag. 15(7/8), 1082–9873 (2009) Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness: lessons learned from assessing the OCR accuracy of the British library’s 19th century online newspaper archive. D-Lib Mag. 15(7/8), 1082–9873 (2009)
Metadaten
Titel
Assessing the Impact of OCR Errors in Information Retrieval
verfasst von
Guilherme Torresan Bazzo
Gustavo Acauan Lorentz
Danny Suarez Vargas
Viviane P. Moreira
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-45442-5_13

Neuer Inhalt