Skip to main content
Erschienen in: Neural Computing and Applications 23/2020

Open Access 09.05.2020 | S.I. : Emerging applications of Deep Learning and Spiking ANN

Building an efficient OCR system for historical documents with little training data

verfasst von: Jiří Martínek, Ladislav Lenc, Pavel Král

Erschienen in: Neural Computing and Applications | Ausgabe 23/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318 Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318
2.
Zurück zum Zitat Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning (ACM), pp 369–376 Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning (ACM), pp 369–376
3.
Zurück zum Zitat Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit (IJDAR) 22(3):285CrossRef Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit (IJDAR) 22(3):285CrossRef
4.
Zurück zum Zitat Breuel TM, Ul-Hasan A, Azawi MIAA, Shafait F (2013) High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition, pp 683–687 Breuel TM, Ul-Hasan A, Azawi MIAA, Shafait F (2013) High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition, pp 683–687
5.
Zurück zum Zitat Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298CrossRef Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298CrossRef
6.
Zurück zum Zitat Sabir E, Rawls S, Natarajan P (2017) Implicit Language Model in LSTM for OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 7. IEEE, pp 27–31 Sabir E, Rawls S, Natarajan P (2017) Implicit Language Model in LSTM for OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 7. IEEE, pp 27–31
7.
8.
Zurück zum Zitat Ul-Hasan A, Breuel TM (2013) Can we build language-independent OCR using LSTM networks?. In: Proceedings of the 4th international workshop on multilingual OCR , pp 1–5 Ul-Hasan A, Breuel TM (2013) Can we build language-independent OCR using LSTM networks?. In: Proceedings of the 4th international workshop on multilingual OCR , pp 1–5
9.
Zurück zum Zitat Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134 Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
10.
Zurück zum Zitat Afzal MZ, Pastor-Pellicer J, Shafait F, Breuel TM, Dengel A, Liwicki M (2015) Document image binarization using lstm: A sequence learning approach. In: Proceedings of the 3rd international workshop on historical document imaging and processing (ACM), pp 79–84 Afzal MZ, Pastor-Pellicer J, Shafait F, Breuel TM, Dengel A, Liwicki M (2015) Document image binarization using lstm: A sequence learning approach. In: Proceedings of the 3rd international workshop on historical document imaging and processing (ACM), pp 79–84
11.
Zurück zum Zitat Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
12.
Zurück zum Zitat Xie S, Tu Z (2015) Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp 1395–1403 Xie S, Tu Z (2015) Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp 1395–1403
13.
Zurück zum Zitat Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal Unsupervised Image-to-image Translation. In: The European conference on computer vision (ECCV) Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal Unsupervised Image-to-image Translation. In: The European conference on computer vision (ECCV)
14.
Zurück zum Zitat Bukhari SS, Shafait F, Breuel TM (2011) Improved document image segmentation algorithm using multiresolution morphology. Document recognition and retrieval XVIII, vol 7874. International Society for Optics and Photonics, p 78740D Bukhari SS, Shafait F, Breuel TM (2011) Improved document image segmentation algorithm using multiresolution morphology. Document recognition and retrieval XVIII, vol 7874. International Society for Optics and Photonics, p 78740D
16.
Zurück zum Zitat He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969 He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
17.
Zurück zum Zitat Breuel TM (2017) Robust, simple page segmentation using hybrid convolutional mdlstm networks. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 733–740 Breuel TM (2017) Robust, simple page segmentation using hybrid convolutional mdlstm networks. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 733–740
18.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735CrossRef
19.
Zurück zum Zitat Breuel TM (2017) High performance text recognition using a hybrid convolutional-lstm implementation. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 11–16 Breuel TM (2017) High performance text recognition using a hybrid convolutional-lstm implementation. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 11–16
20.
Zurück zum Zitat LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995 LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995
21.
Zurück zum Zitat Elagouni K, Garcia C, Mamalet F, Sébillot P (2012) Text recognition in videos using a recurrent connectionist approach. In: International conference on artificial neural networks. Springer, pp 172–179 Elagouni K, Garcia C, Mamalet F, Sébillot P (2012) Text recognition in videos using a recurrent connectionist approach. In: International conference on artificial neural networks. Springer, pp 172–179
22.
Zurück zum Zitat He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence
24.
Zurück zum Zitat Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:​1409.​0473
25.
Zurück zum Zitat Bluche T, Louradour J, Messina R (2017) Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 1050–1055 Bluche T, Louradour J, Messina R (2017) Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 1050–1055
26.
Zurück zum Zitat Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:​1406.​2227
27.
Zurück zum Zitat Margner V, Pechwitz M (2001) Synthetic data for Arabic OCR system development. In: Sixth international conference on Document analysis and recognition, 2001. Proceedings. IEEE, pp 1159–1163 Margner V, Pechwitz M (2001) Synthetic data for Arabic OCR system development. In: Sixth international conference on Document analysis and recognition, 2001. Proceedings. IEEE, pp 1159–1163
28.
Zurück zum Zitat Gaur S, Sonkar S, Roy PP (2015) Generation of synthetic training data for handwritten Indic script recognition. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 491–495 Gaur S, Sonkar S, Roy PP (2015) Generation of synthetic training data for handwritten Indic script recognition. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 491–495
29.
Zurück zum Zitat Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:​1712.​04621
30.
Zurück zum Zitat Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient OCR training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France pp 7–10 Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient OCR training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France pp 7–10
31.
Zurück zum Zitat Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition (IEEE), pp 257–260 Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition (IEEE), pp 257–260
32.
Zurück zum Zitat Breuel TM (2008) The OCRopus open source OCR system. Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F Breuel TM (2008) The OCRopus open source OCR system. Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F
34.
Zurück zum Zitat Leifert G, Strauss T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents Leifert G, Strauss T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents
35.
Zurück zum Zitat Strauss T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition & retrieval engine for ICDAR 2017 competition on information extraction in historical handwritten records Strauss T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition & retrieval engine for ICDAR 2017 competition on information extraction in historical handwritten records
36.
Zurück zum Zitat Wick C, Puppe F (2018) Fully convolutional neural networks for page segmentation of historical document images. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 287–292 Wick C, Puppe F (2018) Fully convolutional neural networks for page segmentation of historical document images. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 287–292
37.
39.
Zurück zum Zitat Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256 Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
40.
Zurück zum Zitat Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 931–935 Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 931–935
41.
Zurück zum Zitat Tong X, Evans DA (1996) A statistical approach to automatic OCR error correction in context. In: Fourth workshop on very large corpora Tong X, Evans DA (1996) A statistical approach to automatic OCR error correction in context. In: Fourth workshop on very large corpora
42.
Zurück zum Zitat Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361 Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361
43.
Zurück zum Zitat Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR) Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)
44.
Zurück zum Zitat Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225 Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225
46.
47.
Zurück zum Zitat Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1MathSciNetCrossRef Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1MathSciNetCrossRef
Metadaten
Titel
Building an efficient OCR system for historical documents with little training data
verfasst von
Jiří Martínek
Ladislav Lenc
Pavel Král
Publikationsdatum
09.05.2020
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 23/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-020-04910-x

Weitere Artikel der Ausgabe 23/2020

Neural Computing and Applications 23/2020 Zur Ausgabe

S.I. : Emerging applications of Deep Learning and Spiking ANN

Anomaly detection via blockchained deep learning smart contracts in industry 4.0