Skip to main content

2021 | OriginalPaper | Buchkapitel

Scalable Handwritten Text Recognition System for Lexicographic Sources of Under-Resourced Languages and Alphabets

verfasst von : Jan Idziak, Artjoms Šeļa, Michał Woźniak, Albert Leśniak, Joanna Byszuk, Maciej Eder

Erschienen in: Computational Science – ICCS 2021

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study provides a working solution that reads the cards, and links their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the Dictionary of the 17\(^{th}\)- and 18\(^{th}\)-century Polish, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model; (2) a recognition model to decipher the handwritten content, designed as a spatial transformer network (STN) followed by convolutional neural network (RCNN) with a connectionist temporal classification layer (CTC), trained using a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step using constrained Word Beam Search (WBC): the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which outperforms the base RCNN model. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International Conference on Computer Vision (ICCV) (2019) Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International Conference on Computer Vision (ICCV) (2019)
2.
Zurück zum Zitat Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019) Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9365–9374 (2019)
3.
Zurück zum Zitat Bilińska-Brynk, J., Rodek, E.: Paper quotation slips to the Electronic Dictionary of the 17th-and 18th-Century Polish - digital index and its integration with the Dictionary. In: EURALEX XIX Proceedings, pp. 465–470 (2020) Bilińska-Brynk, J., Rodek, E.: Paper quotation slips to the Electronic Dictionary of the 17th-and 18th-Century Polish - digital index and its integration with the Dictionary. In: EURALEX XIX Proceedings, pp. 465–470 (2020)
4.
Zurück zum Zitat Bronikowska, R., Majdak, M., Wieczorek, A., Żółtak, M.: The Electronic Dictionary of the 17th-and 18th-century Polish - towards the open formula asset of the historical vocabulary. In: EURALEX XIX Proceedings pp. 471–475 (2020) Bronikowska, R., Majdak, M., Wieczorek, A., Żółtak, M.: The Electronic Dictionary of the 17th-and 18th-century Polish - towards the open formula asset of the historical vocabulary. In: EURALEX XIX Proceedings pp. 471–475 (2020)
6.
Zurück zum Zitat Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of Recurrent Neural Networks for offline handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 279–284 (2014). https://doi.org/10.1109/ICFHR.2014.54 Doetsch, P., Kozielski, M., Ney, H.: Fast and robust training of Recurrent Neural Networks for offline handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 279–284 (2014). https://​doi.​org/​10.​1109/​ICFHR.​2014.​54
16.
Zurück zum Zitat Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus: A service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 04, pp. 19–24 (2007). https://doi.org/10.1109/ICDAR.2017.307 Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus: A service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 04, pp. 19–24 (2007). https://​doi.​org/​10.​1109/​ICDAR.​2017.​307
18.
Zurück zum Zitat Landau, S.I.: Dictionaries: The art and craft of lexicography. Cambridge University Press, 2 edn. (2001) Landau, S.I.: Dictionaries: The art and craft of lexicography. Cambridge University Press, 2 edn. (2001)
22.
Zurück zum Zitat Pal, A., Singh, D.: Handwritten English character recognition using neural network. Int. J. Comput. Sci. Commun. 1(2), 141–144 (2010) Pal, A., Singh, D.: Handwritten English character recognition using neural network. Int. J. Comput. Sci. Commun. 1(2), 141–144 (2010)
25.
Zurück zum Zitat Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr 2014 competition on handwritten text recognition on Transcriptorium datasets (HTRtS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790 (2014). https://doi.org/10.1109/ICFHR.2014.137 Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr 2014 competition on handwritten text recognition on Transcriptorium datasets (HTRtS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790 (2014). https://​doi.​org/​10.​1109/​ICFHR.​2014.​137
26.
Zurück zum Zitat Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr 2016 competition on handwritten text recognition on the READ dataset. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) pp. 630–635 (2016). https://doi.org/10.1109/ICFHR.2016.0120 Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: Icfhr 2016 competition on handwritten text recognition on the READ dataset. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) pp. 630–635 (2016). https://​doi.​org/​10.​1109/​ICFHR.​2016.​0120
27.
Zurück zum Zitat Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233 (2016). https://doi.org/10.1109/ICFHR.2016.0052 Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233 (2016). https://​doi.​org/​10.​1109/​ICFHR.​2016.​0052
28.
Metadaten
Titel
Scalable Handwritten Text Recognition System for Lexicographic Sources of Under-Resourced Languages and Alphabets
verfasst von
Jan Idziak
Artjoms Šeļa
Michał Woźniak
Albert Leśniak
Joanna Byszuk
Maciej Eder
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-77961-0_13