nach oben

Neural Computing and Applications

Erschienen in:

Open Access 09.05.2020 | S.I. : Emerging applications of Deep Learning and Spiking ANN

Building an efficient OCR system for historical documents with little training data

verfasst von: Jiří Martínek, Ladislav Lenc, Pavel Král

Erschienen in: Neural Computing and Applications | Ausgabe 23/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

Vorheriger Artikel Critical infrastructure protection based on memory-augmented meta-learning framework

Nächster Artikel A deep Q-learning portfolio management framework for the cryptocurrency market

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://www.portafontium.eu/.

http://ocr-corpus.kiv.zcu.cz/.

https://github.com/PRImA-Research-Lab/prima-aletheia-web.

https://github.com/tesseract-ocr/.

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM.

https://www.abbyy.com/.

https://github.com/mittagessen/kraken.

http://www.portafontium.cz/.

https://github.com/Belval/TextRecognitionDataGenerator.

Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318

Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning (ACM), pp 369–376

Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit (IJDAR) 22(3):285CrossRef

Breuel TM, Ul-Hasan A, Azawi MIAA, Shafait F (2013) High-performance OCR for printed English and Fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition, pp 683–687

Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298CrossRef

Sabir E, Rawls S, Natarajan P (2017) Implicit Language Model in LSTM for OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 7. IEEE, pp 27–31

Karpathy A, Johnson J, Fei-Fei L (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078

Ul-Hasan A, Breuel TM (2013) Can we build language-independent OCR using LSTM networks?. In: Proceedings of the 4th international workshop on multilingual OCR , pp 1–5

Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134

10.

Afzal MZ, Pastor-Pellicer J, Shafait F, Breuel TM, Dengel A, Liwicki M (2015) Document image binarization using lstm: A sequence learning approach. In: Proceedings of the 3rd international workshop on historical document imaging and processing (ACM), pp 79–84

11.

Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

12.

Xie S, Tu Z (2015) Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp 1395–1403

13.

Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal Unsupervised Image-to-image Translation. In: The European conference on computer vision (ECCV)

14.

Bukhari SS, Shafait F, Breuel TM (2011) Improved document image segmentation algorithm using multiresolution morphology. Document recognition and retrieval XVIII, vol 7874. International Society for Optics and Photonics, p 78740D

15.

Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640. https://doi.org/10.1109/TPAMI.2016.2572683CrossRef

16.

He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

17.

Breuel TM (2017) Robust, simple page segmentation using hybrid convolutional mdlstm networks. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 733–740

18.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735CrossRef

19.

Breuel TM (2017) High performance text recognition using a hybrid convolutional-lstm implementation. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 11–16

20.

LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995

21.

Elagouni K, Garcia C, Mamalet F, Sébillot P (2012) Text recognition in videos using a recurrent connectionist approach. In: International conference on artificial neural networks. Springer, pp 172–179

22.

He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Thirtieth AAAI conference on artificial intelligence

23.

Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711

24.

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

25.

Bluche T, Louradour J, Messina R (2017) Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 1050–1055

26.

Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227

27.

Margner V, Pechwitz M (2001) Synthetic data for Arabic OCR system development. In: Sixth international conference on Document analysis and recognition, 2001. Proceedings. IEEE, pp 1159–1163

28.

Gaur S, Sonkar S, Roy PP (2015) Generation of synthetic training data for handwritten Indic script recognition. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 491–495

29.

Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621

30.

Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient OCR training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France pp 7–10

31.

Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition (IEEE), pp 257–260

32.

Breuel TM (2008) The OCRopus open source OCR system. Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F

33.

Vincent L, Lead UT (2006) Announcing tesseract OCR, Google Code. http://googlecode.blogspot.com.au/2006/08/announcing-tesseract-ocr.html. Accessed 1 Nov 2015

34.

Leifert G, Strauss T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents

35.

Strauss T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition & retrieval engine for ICDAR 2017 competition on information extraction in historical handwritten records

36.

Wick C, Puppe F (2018) Fully convolutional neural networks for page segmentation of historical document images. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 287–292

37.

Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62MathSciNetCrossRef

38.

Martínek J, Lenc L, Král P, Nicolaou A, Christlein V (2019) Hybrid Training Data for Historical Text OCR. In: 15th international conference on document analysis and recognition (ICDAR 2019), Sydney, Australia, pp 565–570. https://doi.org/10.1109/ICDAR.2019.00096

39.

Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

40.

Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 931–935

41.

Tong X, Evans DA (1996) A statistical approach to automatic OCR error correction in context. In: Fourth workshop on very large corpora

42.

Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361

43.

Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

44.

Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225

45.

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

46.

Alberti M, Bouillon M, Ingold R, Liwicki M (2017) Open evaluation tool for layout analysis of document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Kyoto, Japan, pp 43–47. https://doi.org/10.1109/ICDAR.2017.311

47.

Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116(1):1MathSciNetCrossRef

Titel: Building an efficient OCR system for historical documents with little training data
verfasst von: Jiří Martínek
Ladislav Lenc
Pavel Král
Publikationsdatum: 09.05.2020
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 23/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-020-04910-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 23/2020

Electronic word-of-mouth effects on studio performance leveraging attention-based model

Anomaly detection via blockchained deep learning smart contracts in industry 4.0

Retraction Note: Modeling the correlation between Charpy impact energy and chemical composition of functionally graded steels by artificial neural networks

An online self-organizing algorithm for feedforward neural network

Double graphs-based discriminant projections for dimensionality reduction

Correction to: Generative image completion with image-to-image translation